DOCOMEUT RESUME 



ED 078 877 



LI 004 412 



AUTHOR 
TITLE 

INSTITUTION 
PUB DATE 
NOTE 

AVAILABLE FROM 



EDRS PRICE 
DESCRIPTORS 



Fiorello^ Marco R. 

Management and Design Tools for Document Retrieval 
Systems: A Method for Predicting Quantity Output. 
Rand Corp., Santa Monica, Calif. 
Mar 73 

221p.;(153 References) 

The Rand Corporation, 1700 Main St., Santa Monica, 
Calif. 90406 (S5.00) 

MF-$0.65 HC-$9.87 

4" Design; Information Processing; ^Information 
Retrieval; '•'Information Storage; ^cxntormation 
Systems; ^Management; Search Strategies 

ABSTRACT 

The existing volume and increasing growth rate of 
documented information ha% resulted in^ numerous efforts to construct 
operational Document storage and Retrieval Systems, as jsl practical 
solution to the demand for information storage and retrieval. 
AccoiRpahying the surge to build more and better and bigger Document 
Retrieval Systems (DRSs) , was the realization that there are few 
effective tools for the designers and managers of these systems. The 
tasks of design and management of DRSs requires tools and performance 
measures to aid in the selection of preferred opticms, and in the 
control over the fundamental processes of inquiry analysis. Indexing, 
retrieval and system growth. A step toward the generation of 
operational tools to aid in the design and management tasks is 
presented in this report, by the development of a Retrieval Qu€uitity 
|Rq) estimate. The Rq estimate is defined as a function of the 
Miquiry form, search strategy and descriptor-document distribution, 
and can be used to predict the quantity output due to system groirth, 
and aid in the tuning of the indexing and formal inquiry 
specification processes. . (Author/NH) 



iii 
ABSTRACT 

The existing volume and increasing growth rate of documented in- 
formation has resulted in numerous efforts to construct operational 
Document Storage and Retrieval Systems, as a practical solution to the 
demand for information storage and retrieval. Accompanying tha surge 
to build more and better and bigger Document Retrieval Systems (DRSs), 
was the realization that there are few effective tools for the design- 
ers and ma.iagers of these systems. The tasks of design and management 
of DRSs requires tools and performance measures to aid in the selec- 
tion of preferred options, and in the control over the fundamental 

-processes of inquiry analysis, indexing, retrieval and system growth. 
A step toward the generation of operational tools to aid in the 
design and management tasks is presented in this report, by the de- 
velopment of a Retrieval Quantity (R^) estimate. The R^ estimate is 
defined as a function of the inquiry form, search strategy and des- 
criptor-document distribution, and can be used to predict the quantity 
output of an inquiry* measure the impact on quantity output due to 
system growth, and aid in the tuning of the indexing and formal in- 
quiry specification processes. The definition of t'he R^ measure is 
based on the identification of certain canonical forms which charac- 
terize the underlying principles of DRS Indexing and retrieval. The 

. Rq estimate was tested on an operational DRS, and demonstrated high 
prediction accuracy for a variety of typical Inquiries. Though devel- 
oped on a small DRS, the methodology for determining R^ appears to 
hold for a very wide range of system size, subject content and con- 
struction. 



V 

ACKNOWLEDGMENTS 

I take this opportunity to acknowledge my debts to my colleagues 
at Rand and the Institute of Library Research at the University of 
California for the many fruitful conversations and assistance of vari- 
ous kinds that were so helpful in this task. 

To my thesis committee of Dr. C. West Churchman, Dr. Bill Maron, 
and Dr. Franco Nicosia I am most indebted. They have been patient, 
understanding and always cooperative. 

I am grateful to all these friends, and to each my fondest thanks. 



vii 



LIST OF TABLES 

Table Title Page 

3*1 Retrieval Quality Performance Measures 47 ' 

5.1 Institute of Library Research Document Retrieval 
System Characteristics 71 

5.2 Sample System Characteristics 71 

5.3 Comparison of System Characteristics for: 
The Institute of Library Research , and 
Systems Investigated by Litofsky (90), 
Houston and Wall (68), A. D. Little (1, 2), 
Wall (143) 85 

5.4 MeZ Parameters and Function Values 90 

5.5 Comparison of y and Factors 102 

5.6 Test Inquiries 142 

5.7 Comparison of Actual Versus Estimated Quantity 

j ^ Output 144 

I 5.8 Term-Term Co-occurrences Between Terms with Dif- 

ferent Frequency of Use 151 

5.9 Coefficiencies of Association Parameter — a 154 

5.10 Comparison of Outputs for Stage 1 and Stage 2 for 
the Test System I57 



ix 



LIST OF FIGURES 

Figure Page 

1.1 Estimate of File Size of Books and Periodicals 

in U.S. Colleges 4 

1.2 Estimate of File Size of U.S. Public Libraries 5 

1.3 Estimate of the Number of Technical Literature 
Abstracts Produced in the World 6 

1.4 Growth of Journals and Abstract Journals 7 

1.5 Literature Growth in Economics and Other Pro- 

1 fessional Fields 8 

I 

1 1.6 Information Storage and Retrieval Systems 10 



2.1 Functional Description of Document Retrieval 
Systems 14 

2.2 • Hierarchical Systems 21 

2.3 Facet Systems 22 

2.4 Taxonomy of Coordinate Retrieval Systems 24 

2.5 Index Vocabulary Illustration 25 

2.6 Inverted File Illustration 26 

2.7 Equivalent Logical Notation Illustration 30 

2.8 DXT Matrix-Assignment of Terms to Documents 32 

2.9 Inverted TXD Matrix, and Sample Inquiries 32 

2.10 Illustration of Inclusive and Exclusive Retreival 34 

3.1 Contingency Table Representation of DRS Corpus 

Classification by an Inquiry 45 

4.1 Theoretical Versus Actual Number of Documents 
Retrieved — A. D. Little Model 51 

4.2 Normalized Term Usage Vs. Rank 54 

4.3 Document-Term Association Matrix 54 

4.-4- MEZ and Zipf — Term Frequency of Use Vs. Term 

i Rank Distribution 61 

I - . ■ 



ERLC 



X 



Figure Paqe 

5.1 Test System Term Frequency of Use Vs. Term 

Rank Distribution 73 

5.2 Relationship Between Term-Document Matrix and 
Term Usage and Cumulative Percent Utilization 

of Thesaurus 75 

5.3 Term Frequency of Use Vs. Rank for Test Sample, 

Test System and Two Cases Analyzed by Litofsky (90) 77 

5.4 Term Frequency of Use Vs. Rank for System Investi- 
gated by A. D. Little (1,2) 78 

5.5 Term Usage Vs. Cumulative Thesaurus Utilizations 
of Thesaurus for Systems Investigated by Houston 

\ and Wall (68) 79 

5.6 Term Usage Vs. Cumulative Utilization of The- 
saurus for Systems Investigated by Wall 

(143) 80 

5.7 Term Usage Vs. Cumulative Utilization of The- 
saurus for Systems Investigated by Litofsky 

(90) 82 

5.8 Term Usage Vs. Cumulative Utilization of The- 
saurus for Test System 83 

5.9 Comparison of MEZ Canonical Form with Test 

System 89 

5.10 Test System Depth of Indexing Density Dis- 
tribution ^. 92 

5:11 Depth of Indexing Distribution for Systems Inves- 
tigated by Litofsky (90) 93 

5.12 Derivation of the TXT Matrix 94 

5.13 Illustration of a Sparse DXT Matrix 97 

5.14 Theoretical TXT (i,j) Prediction Factor-y 103 



5.15 

5.30 
5 



Plotted y-Factors for Test Sample for 104 
1 < f(i).f(j) < 32 to 

119 



.31 ) 
.37) 



Plotted Cumulative Frequency of Occurrence of 121 
f c 07 I the Ratio of Actual to Theoretical-y's to 



ERIC 



xi 



Figure Page 
5.38 Term Co-Occurrence Values for f(i) and f(j) 131 

5.39 1 132 
1, Density of Term Co-Occurrence Values for f(i) 

5.40 J 133 

5.41 Upper and Lower Bound Limits for the y-Factors 

for the Test System 138 

5.42 Adjusted y-Factors for the Test System 141 

5.43 Theoretical Probability of at Least One Co- 
occurrence for Terms with f(i) = 1 and 

1 s Jj^s D 150 

5.44 Cumulative Plot of Quantity Output for Co- 
efficient of Association (6) for the Test System 156 

6.1 Theoretical Family of Curves Defining the Lower 
Bound of the Probability of Co-Occurrence of 
Two Terms with f(i) = 1, 1 i f(j) ^ D and 
1 i s Jy = 1 162 



ABSTRACT 
ACKNOWLEDGMENTS 
LIST OF TABLES 
LIST OF FIGURES 



xiil 



TABLE OF CONTENTS 



Chapter 1 



Chapter 2 



INFORMATION STORAGE AND RETRIEVAL: BACK- 
GROUND ISSUES 

1.1 Introduction 

COORDINATE INDEX DOCUMENT STORAGE AND RETRIEVAL 
SYSTEMS: A FORMAL DESCRIPTION 

2.1 Document Storage and Retrieval Systems 

2.2 Document Selection: Sizing the 
Collection 

2.3 Indexing •>•> Document Analysis and 
Representation 

2.3.1 Coordinate Indexes 

2.4 The Index File 

2.5 Inquiry Formulations 

2.5.1 Inquiry Grammar 

2.6 Search Files and Retrieval Process 

2.7 DRS — A Brief Formal Description 

2.8 Retrieval Set Characteristics 



Chapter 3 RETRIEVAL QUANTITY AND DRS PERFORMANCE MEASURES 

3.1 Introduction 

3.2 Measures for Evaluation 

3.2.1 Response Time 

3.2.2 System Costs 

3.2.3 System Convenience of Use 

3.2.4 System Flexibility 

3.2.5 Retrieval Quality 



xiv 



Chapter 4 RETRIEVAL QUANTITY ESTIMATION: LITERATURE RE- 
VIEW AND PROPOSED NETHOOaOGY 

4.1 Introduction 

4.2 General Critique of Prevlou!^ Research 

4.3 Proposed Methodology for Developing the 
Rq Measure 

4.3.1 Fundamental DRS Relationships 

4.3.2 Inquiry Definition and Generation 

4.3.3 Inquiry — Retrieval Quantity 
Measure Relationship 

4.4 Hypotheses for Retrieval Quantity 
Estimations 

-Chapter 5 THE RETRIEVAL QUANTIH MEASURE: EXPERIMENTS 
AND RESULTS 



5.1 
5.2 



5.3 



Introduction 
Setting and Description 

5.2.1 Experiments and Analysis 

Document Retrieval Systems — Common 
Characteristics 

5.3.1 The Term-Frequency-of-Use 
Distribution 

5.3.2 The Term-Frequency-of-Use i.^^ 
Canonical Form 

5.3.3 Depth of Indexing Distribution 

5.3.4 The Term-Term Co-Occurrence 
Distribution 

5.4 The Retrieval Quantity Measure 

5.4.1 Application of the Term Co- 
occurrence Factor, Y and 
Determination of Rg 

5.4.2 Testing the R^ Estimate 

5.5 The Likelihood of Non-Zero Term-Term 
Co-Occurrences 

5.6 Word Association Coefficients 

5.7 System Growth Impact on Retrieval 
Quantity 

Chapter 6 CONCLUSION AND SYNTHESIS OF FINDINGS 
6.1 Introduction 



6.2 General Conclusions 159 

6.3 Management and Design A^ds 161 

Chapter 7 RECOMMENDATION FOR ADDITIONAL RESEARCH 169 

7.1 Introduction 169 

7.2 Corpus Homogeneity and Heterogeneity 169 

7.3 Distribution of Terms with Common Fre- 
quencies 01 Use 170 

7.4 The MEZ Canonical Form 170 

7.5 Depth of Indexing Distribution 1/1 

7.6 Higher Order Term Associations 172 

7.7 Rq Model Extensions 173 

7.7.1 Psychological Analogies 174 

BIBLIOGRAPHY 176 

APPENDIX A Glossary of Terms 185 

APPENDIX B Institute of Library Research Laboratory DRS 188 

0 Thesaurus (Sample) 

0 Document Descriptions (Sample) 

APPENDIX C Sample Data Base Characteristics 198 

0 Tem Frequency of Use Listing 

0 Depth of Indexing Listing 

0 Term-Document Matrix In Condensed Array Format 

APPENDIX D Illustrations of Computations to Estimate 

Retrieval Quantity 207 



1 



It is in the nature of the mind to forget and in the na- 
ture of man to worry over his forgetfulness.... 

Bower 

Chapter 1 

INFORMATION STORAGE AND RETRIEVAL: BACK6R0U»MD ISSUES 
1.1 INTRODUCTION 

Man has always employed some means of storing and retrieving informa- 
tion. In ear.Xy tribal or closed society environment man's memory was the 
principal repository of knowledge, the link between successive generations 
and between the discovery of new knowledge and those who would use it. 
The advent of formal speech and recordable languages provided the means for 
accumulation of experience and knowledge in mediums for transmission, 
storage and use by others, in a relatively time independent sense (93). 
As the scope and content of information became more voluminous and complex, 
formal systems were constructed for information storage and retrieval. 

This report is concerned with certain underlying principles that 
characterize a large class of formal information storage and retrieval 
systems. Throughout the discussion that follows, at the risk of termino- 
logical monotony, the term information will be continuously used to des- 
cribe what "it" is that information storage and retrieval systems store 
and retrieve. No definition of information is given, principally be- 
cause there is no generally accepted precise definition available. Des- 
criptively information has been labeled; the essential ingredient of con- 
versation, writing and thought; recorded experience essential for de- 
cisionmaking; the essential link between means and ends; a resource; 
meaningful data; the result of a process on data; and a symbol or signal 
that a system can employ to guide or control its functions (6, 26, 27, 
149). Information, however, is not considered to be knowledge, per se_, or 



2 

communication. On the other hand, knowledge Is thought of as an organized 
body of Information, and comnunl cation Is viewed as Information transfer. 
The notions become 2ven more confounded when one considers the additional 
(though fuzzy) distinctions between data and information, data and knowl- 
edge, and so on. 

Suffice it to say, that the entitles — data. Information, and knowl- 
edge are different, relative In place and time, and that the basis of dis- 
tinction Is in part rigorously quantitative. (viz.. Information Theory 
(145)) and qualitative (i.f., contemporary, intuitive concepts and usage). 
For this analysis, information is intuitively treated as existing in graphic 
records* (e.g., documents) and to be perceivable by an inquiring mind which 
has a need for information. 

Contemporary society can be viewed as an enormous information gen- 
erating, processing, storage and retrieval mechanism. The problem of over- 
abundance of information is compounded by a seemingly cDltural magpie-like 
behavior which seeks to store the better part of all Information and to 
retrieve it as well (29). There is no accurate census of the literature 
po; ilation, but a number of statistical estimations have been made. 
Oe Solla Price (41) has estimated that 350,000 scientific papers are pub- 
lished annually. Bourne (12, 13) has estimated that there are 30 to 
35,000 journals published annually of which 15,000 are significant,** 
and that the volume of significant papers published throughout the world 
per year is between 900,000 and 2,100,000. Further there are an estimated 

?E 

The specification of graphic records is for the purposes of this 
analysis, and is not meant to imply that written/printed language is the 
only source of information. Other media, often less restrictive, are the 
non-graphic verbal and non-verbal. 

No definition of significance is given. 



3500 abstracting and indexing services in the world (circa 1960). In 
addition to the periodical population there are the monograph and ab- 
stract files. Figure 1.1 illustrates the estimation of the file size 
of books and periodicals in U.S. colleges and universities, and Fig. 
1.2 the file size for U.S. public libraries. An estimate of the num- 
„ber of technical literature abstracts and/or citations produced annually 
throughout the world is given in Fig. 1.3. 

While the per annum volume of periodicals, abstracts and monographs 
is impressive, the estimated growth rates are staggering. DeSolla Price 
(42, 43) has plotted (see Fig. 1.4) the growth of scientific and abstract 
journals published from the oldest surviving periodical* to the year 
2000, and an exponential growth is clearly evident. Hold (67) surveyed 
the growth of the professional literature in economics, electrical eni____ 
gineering, physics, psychology and biology, and al so. observed exponen- 
tial growth characteristics;..his results are shown in Fig. 1.5. Holt 
(67), Brookes (19), and Krauze (79) all suggest that the growth of lit- 
erature in terms of the number of articles and journals is of the form: 

where - total volume of literature (in the field of interest) 
at time t 

= volume of literature at time t 

o 0 • 

r = the growth rate (estimated to result in . doubling every 
10 years). Note: e^^** »2 r = 7 percent per annum 

' * 

Philosophical Transactions of the Royal Society of London (1665). 



.4 



to MILL 



I — I I I n in| 1 — I I 1 1 ni| 1 i i n ! n | r 



i MILL 



lOOPOO 



I/) 

2 



! 

i 



lOPOOr- 



IpOO 



too 



10 



1 11 n 

KK) 



« 




t I 11 III I — I u 



20 so I 90 I 70 to 90 95 
40 CO 

CUMULATIVE KftCCNT OF TOTAL FtLCt 



i I I imt 



imil 



-1 — I I 1 1 iiq 



-to 



-+•0 g 

70^ 
CO P 

40 ^ 



— I-O 
100 



to 100 ipOO 

NUMBER OF FILE ADDITI0I4S PER WEEK 



'Mil 



IQPOO 



toopoo 



FIG. U7 U.S. cott«9t end untvtrttty tibrorttt-filt ttxt and occttsion rottt. [Sowft Library Sfatiatcs of CoT/tgts oih/ 
OwVtfwf/ti, foff /: /ntfi^uf/onof Oo/o, VS. Dtpt. of H^attli, Education ond Wtlfort, Offict of Educotlon, J. C. 

lloHior ond D. C. Hotlodoy, ioport OM'5033 (1961).] (i3) \ 



5 



lOMLL 



I I I 1 1 1 1 1 



lOOPOO 



S IQpOO 



WOOhr 



•00 r- 



] 1 — I I 1 1 ni| 1 — I M I iri| 1 — I n il 



••••• 
• * • 



: U 



0 



20 I 40 
90 



ma 

K>1 TOM 



•0 90 



4 L 



95 



r| 1 — I I 1 1 MM 

oo 



- •0 ^ 

-»<0 ^ 

- so J! 

- 40 M 

- •30 > 

- to I 



— ^o 

100 



CUMUUnVC KRCCNT OT TOTftL riLCS 



K) 



i » t I Mill — I I i r I ml » ^ ' ' « ' « < « ' < 



10 100 ipoo IQPOO 

NUMBER OF FILE ADDITIONS PER WEEK 



lOOpOO 



FIG. 1<2 VX public library t)ntonM-lilt ond oectssioii fot«t. [Sourcctt SMMIa of Fubfk Ubrary $ysfm$ la 
wHk FepuMofif m/H>0 or A4er«r ^to/ ^95$, OJB. D«pt. of HmHK Edueoflon and Wolfort, Offiet of fducotlon, 
Orcvkir 590 (Juno 1959)i SM^fflet of Fubft^ Ubrary $fii»mt in CHht wHh Fopufoflofif of 50/>00 to 99/000: ^/fcof ytor 
1951/ MS. Otpr* of Hooltii/ Educotloii and Wolforo, Offico of Educotiofi/ Clroitor 594 (July 1959); Fub»c Ubrory Stofiifte 
1944-45 {hr elHot wlHi populoltoni of 25,000 to 49,999), Fodtrol Socurity Agency, Offico of Education (1947).] (j^^ , 



•^••^^P 1 I I M 1 — 1 I I I 1 — I I I I !H| 1 — I I. IUU| 1 — r-TTTTTT 



iOQpOO 



9 

S tQPOO 



I 



ipoo 



100 



10 



note; the ntc sac iMctuocs au AtsriiAcrs on citations mttisHco tv each 
scnvicc its inception throuoh the year itsa 



t > ttiiil — i i I Mini > » » »« » I i 11 iii< t L MJ iti 



to 100 ipoo 

mmBtn or file A00tTK>NS per week 



lOPOO 



toopoo 



BO. Accumulot«4 sIms ond airr«nt occMiIon rotM of th« publkotloni of lovorot obtfrocting and Indoxing lorv- 
|«N. Noftt Tho fiU ttto {ndudos oil obitroctt or cHottont pvbllilitd by ooch lorvko from itt» Incoptton through tho yoor 
1W0. (13;. 



8 




2,g00 

l»u> tfto 1940 ifro itto 

TIM5 

Fig. 1.5 - PeHodical Growth (67) 



= the statistical error of measurement; assumed to have 
the property Expected Value (e^) = 0 

= the death rate of journals, and the death rate or near- 
usefulness of articles? 

The above analysis, admittedly cursory and nori-rigorous, does imply an 
information control, storage and retrieval problem. All evidence seems 
to say that for any established field there is an abundance of informa- 
tion, and it is growing. 

There is a bonefide need to store a substantial portion of existing 
literature, and there is a need for a physically feasible means of re- 
trieving information that is both economically practical, and time and 
content relevant to the information user. Concern about this information 
handling problem has placed new emphasis on the traditional activities of 
assemblying and coding recorded information, and has resulted in the 
emergence of a new discipline. Information Science, which focuses on the 
analysis and solution of information, storage and retrieval (ISR) prob- 
lems. A variety of systems, processes and techniques has been con- 
structed to cope with many ISR problems, and a typical set of ISR pro- 
cesses and their interactions are illustrated in Fig. 1.6. 

An important subset of ISR systems are document retrieval systems 
(DRS), which, as the title implies, retrieve documents and hence the 
information in them indirectly. This subset of systems, for instance, 
excludes fact retrieval or intelligence retrieval systems. The term 



A great deal of conjecture surrounds the assessment of D*. It 
is believed (Brooks (19)) that it has exponential properties, §ut 
these are very relative to the user and subject in question. 



10 




4J 

O 



3 



e 
I 

f 



1 

c 

I 



ERIC 

1 



n 



document Is used as a generic for information bearing Items — mono- 
graphs/books, periodical articles, abstracts, film, machine coded/ 
readable tape, etc. It Is with this particular class of ISR systems 
that this report Is concerned. 

To date, extensive research has been carried out on various as- 
pects of DRSs. Ostensibly, major efforts* have been made In Index analy- 
ses and evaluation by Cleverdon (30, 31, 32), Taube (135, 136), Gull 
(61), Thome (138) and Swanson (128); user satisfaction by Borko (11), 

Bourne (14, 15), Falrthome (47), Goffman (56), Rees (112, 113) and 

1 

Swets (131, 132); Retrieval Output^relevancy by Barhydt (5), Cuadra 
(36, 37, 38), Doyle (45), Goffman (57), Lancaster (82), and Salton 
(117, 118, 119); and, automatic classification by Litofsky (90); not- 
withstanding these and other efforts, more problems remain unsolved 
than solved In the design, management and evaluation of DRSs. Of par- 
ticular Interest Is the class of problems concerning the estimation 
of the retrieval quantity of DRSs. This particular dimension of DRS 
performance has not been thoroughly analyzed, and no satisfactory op- 
erational solution has been suggested. 

The basic objective of this research Is to develop a methodology 
that will enable designers and managers of DRSs to estimate the quan- 
tity output In response to an Inquiry, prior to the processing of the 
inquiry. A secondary objective is to demonstrate how the estima* 
tion methodology can be used to assess DRS changes over time. 

No attempt is made to be exhaustive, the cited work is Intended 
to be a representative sample of previous efforts by some of the more 
well-known researchers in Information Science. 



12 



Before proceeding with the derivation of the retrieval quantity (R^) 
measure,^ the context and qualifications of the analysis will be pre- 
sented. In the next chapter the specific class of DRSs for which the 
Rq estimation procedure Is to apply are described, and Chapter 3 pre- 
sents a survey of the many DRS measures of performance to place the R 
measure In perspective. 

Chapter 4 presents a discussion of previous efforts to develop 
an output quantity measure, and also contains a formal description of 
the recommended methodology to develop the estimate. In Chapter 5, 
a description of the experiments performed to evaluate the R^ estima- 
tion procedure, and the results of the experiments are presented. 
Chapter 6 presents a discussion of various applications of the R 
measure to aid In the management and design of DRSs. Also, Appendix A 
contains a glossary of terms to Information Storage and Retrieval ter- 
minology. 



13 



A man should keep his little brain attic stocked with 
all the furniture that he is likely to use, and the rest he 
can put away in the lunterroom of his library, where he can 
get it if he wants it. 

Sherlock Homes 

>, 

Chapter 2 

COORDINATE INDEX DOCUMENT STORAGE AND RETRIEVAL SYSTEMS: . 
A FORMAL DESCRIPTION 

2.1 DOCUMENT STORAGE AND RETRIEVAL SYSTEMS 

Document Retrieval Systems (DRSs) are a class of Information re- 
trieval systems solely concerned with the subject analysis of document 
content, the storage of a set of official surrogates "defining" docu- 
ment content, and the "mechanical** search of the surrogate set to Iden- 
tify or select those documents most "relevant" to a user*s formal re- 
quest. The basic functions of a DRS are Illustrated In Fig. 2.1. 

Of special Interest to this discussion are systM output, user 
Inquiries, and Index characteristics. Since each of these processes 
and products Is embedded In a system and Is directly Influencea by 
other system comp^nts, a brief review of the major system functions 
will be presented to place following developments In proper system 
perspective. 

2.2 DOCUMENT SELECTION! SIZING THE CaLECTION 

Mention has alrea4y been made of the existing volume .^d growth of 
documented Information, and of the associated problems of researchers, 
students, etc. concerned with keeping abreast of their fields of In- 
terest. 

It Is elementary, however, to note that not all existing Infer- 
nation related to any one subject should be stored In DRSs serving 



15 



users In that field. As well It Is equally evident, that not all newly 
generated documented literature Is a contribution to that field* and 
uncurbed storage of documents would result In unsatisfactory DRS per- 
formance. From the point of view of the user the quantity and quality 
of systems output would leave much to be desired. From the point of 
view of the manager, the costs of Indexing, analysis and searching would 
be out of balance with the systems effectiveness^. In order to manage 
a document collection, selection criteria are required and document 
filtering are necessary. Simply put, not all documents in a subject 
field should be input into a DRS, and not all documents input in the 
DRS should be stored forever. 

With regard to the issue of document collection size there are 
certain models that have been developed that can aid the DRS designer 
and manager to estimate the nunter of documents or journals that should 
be reviewed to yield a desired number of subject-relevant "documents," 
or conversely to estimate the number of "documents" that are generated 
by a certain nuni>er of journals. The two models have been referred to 
by Leimkuhler (88), as the Bradford Law of Scattering and the Bradford 
Law of Distribution, and as one might suspect are inversely related. 
Bradford (16) first stated the relationship of "documents" to journals, 
as follows: 

if a large collection of papers is ranked in order of de-. 
creasing productivity of papers relevant to a given topic, 
three zones can be markad off such that each zone pro- 
duces one>third of the total of relevant papers. The 
first, the (sic) nuclear zone, contains a smaller number 
of highly productive journals, say n^; the second zone 
contains a larger nmber of moderately productive jour- 
nals, Sdiy np, and the outer zone a still larqer number 
of journals of low productivity, say n-. The Law of 
Scatter states that. 



16 



n^tngtrtj = l:a:a 

where a 1s a constant. 

In the subject of geophysics, which Bradford analyzed, "a" was 
approximately equal to five. 

Subsequent to Bradford's effort Vickery (141), Kendall (75),Le1ni- 
kuhler (88), Fairthome (49) and Brookes (20) have each made contri- 
butions to the Interpretation and operational Ity of the Bradford Law 
of Scatter. Leimkuhler (88) has shown the Inverse relationship be- 
tween the Law of Scatter (the distribution of the number of. journals 
containing a given fraction of relevant documents) and the Law of 
Distribution (the distribution of document productivity In a collection 
of journals) and has expressed the latter In the following form: 

where F(x) = the cumulative fraction of "documents" In a collection 
of journals on a specific subject 
X = the corresponding fraction of the most productive jour- 
nals In the collection; and 0 ^ x j; 1. 
e = a constant related to the subject field and 
the completeness of the journal collection. 
The above model enables a DRS designer or manager to estimate the 
relationship between the number of documents In the system corpus," 
and the nunber of documents In the population of journals on a spe- 
cific subject. In other words, the Bradford relationships can be 



17 



used^ to relate the productivity of a collection of journals to the 
population of journals, and aid In the selection of journals to yield 
documents for the corpus. Given a subject field and budget constraints, 
these relationships can aid In the cost/benefit tradeoff between 
budget dollars and the number of documents/ journals to collect. 

2.3 INDEXING " DOCUMENT ANALYSIS AND REPRESENTATION 

For this discussion, Indexing will be defined as the assignment 
of subject content Indicating terms to a document. The purpose of 
the Indexing operation Is to make It possible to search a file of the 
content Indicating terms, that are mapped onto the set of documents, 
as a substitute for searching the document set, and to Identify those 
documents relevant to an Inquiry. Relevant Is used here to mean 
that condition In which the terms used In the Inquiry are also used 
to describe the selected documents. 

It Is of course theoretically possible to review the set of 
documents as opposed to the Index file, but this approach quickly be- 
comes physically and economically Impractical for even moderate col- 
lections (several hundred) of documents. Thus the Index provides a 
manageable set of content Indicating terms and classes to be searched 
In place of the corpus, and provides a vehicle to Identify those docu- 
ments In the corpus most likely to contain the desired Information. 

There Is In fact a spectrum of Indexing philosophies, and asso- 
ciated techniques with various proper names. That they are all related 

Groos (60) has observed a departure from the linear relationship 
In log-log space of the Bradford Law when plotting the Keenan-Atherton 
data for physics. The observed deviation, however, has not been thor- 
oughly evaluated to determine If the cause lay In the assumptions of 
the Bradford Law or In the Incompleteness of the experimental observa- 
tions. 



18 



or relatable has been discussed by Artandi (3), Bourne (13), Jahoda (71), 
and demonstrated by Foskett (52). Basic to any Indexing process is the 
set of vocabulary terms employed to describe the content of the docu- 
ments. The_set of vocabulary terms constitutes the index language, 
and as well, an Important part of the Inquiry language of DRSs. The 
latter property follows from 1:he fact that once the Index terms have 
been assigned to the set of documents, they are then used to repre- 
sent the documents and become the vehicle to map Inquiries onto the 
corpus . 

Traditionally, subject classification concepts Involve the use of 
formal schemes to organize the subject matter In a predetermined order 
to some prescribed depth of detail. Typically, these traditional class- 
ifications are hierarchical In nature; that Is, there exists among the 
set of descriptors a rather precisely defined relationship of every 
term to every other term. At the other end of the spectrum there are 
the "key word" systems, which In their simplest form have no word 
relationships defined, and usage of ~ and addition to — the descriptor 
vocabulary Is unrestricted. Artandi (3) makes a useful distinction 
between "systems vocabulary" and "lead-In- vocabulary" as a means of 
distinguishing between word indexing and subject indexing. They 
are both methods of representing document content, but they differ 
operationally. By "systems-vocabulary" it is meant the set of terms 
under which document content descriptor entries are made; that is, the 
terms used to index the documents. The "lead-in-vocabulary" of a DRS, 
"is an index referring from terms used in the literature to terms in 
the system vocabulary, (3)." The principle characteristic of word 



19 



indexing is that descriptors or words are employed as they are found 
in the text of documents to serve as index terms. Thus word indices 
are derived from the documents that are being indexed. 

The key word in context (KWIC) index is an example of word index- 
ing. in its simplest form, involving elementary alphabetical permuta- 
tions of the "key words" in the document titles. 

2.3.1 Coordinate Indexes 

Word indices in which the index terms are manipulated or coordinated 

It 

are called coordinate index systems. Further, those DRSs in which 
the coordination of the descriptors is done in the indexing process 
are called pre- coordinate DRSs. Analogously, those systems in which 
the coordination of the descriptors takes place during the inquiry 
generation process are called post-coordinate DRSs. The pre- and post- 
distinction obviously refer to the temporal occurrences of the event 
of combining descriptor terms. 

The important characteristic of pre-coordinate DRSs is that the 
searching occurs and the inquiries generated, using the terms and their 
combinations the indexor has prepared. There is no additional coordi- 
nation of the descriptors at the time of the inquiry. 

Traditional examples of pre-coordinate systems are the hierarchical 
systems in which a tree structure is employed to define a generic- 
subordinate relationship and the coordinated relationships among the 



s 

As first developed, "coordinated terms" literally implied the 
statistical conjunction of two or more terms. However, the meaning of 
coordinate Index" as used in most post-coordinate-index systems has 
been broadened to Incorporate the full set of Boolean operators, and 
in some Instances even syntactical, semantic and syndetic term- 
relationships. 



20 



subordinate terms. Figure 2.2 Illustrates a typical hierarchical scheme, 
and some examples are the Library of Congress, Dewey Decimal, and 
Universal Decimal Classification Systems. 

Another class of pre- coordinated systems are facet Indices. A 
facet Is a set of terms which occurs with sufficient frequency In a 
subject field to provide a useful category or facet of tejm for the 
description of documents In that field. A schematic of a facet Index 
Is given In Fig. 2.3. In these systems, the pre-coordlnatlon of the 
descriptor terms occurs at the time the facet Is defined. The concept 
of faceted systems for subject description was first developed by 
Ranganathan (110) In his colon classification scheme. 

Although the above two classes of pre-coordlnate Index systems 
exhibit strong structural properties, there are also pre-coordlnate 
systems which have no hierarchies or proper set structure. Such sys- 
tems essentially consist of a set of descriptors (the vocabulary), and 
a set of Indexing and vocabulary control rules. 

Post-coordinate Index schemes, as noted previously, are exemplified 
by the combination of more or less elemental Index terms at the time 
of Inquiry generation and search Initiation. These systems are adaptive 
In that they can accommodate shallow or deep Indexing as well as simple 
or complex Inquiries. In their earliest form, post-coordinate re- 
trieval systems were known as Uniterm systems, after Taube (135). The 
Uniterm Is a unit or elemental concept, usually a single word, used 
to describe the subject of a document. In many systems, the vocabulary 
Is quite often derived from the text and title of the documents to be 
Indexed, and no control Is applied over the vocabulary or the coordina- 
tion of the descriptor terms. The post-coordinate Index system Is a 



21 



CM 
CO 



c 

CM 
CM 



CO 



CM 



CM 
CM 



CM 



CM 
CM 



CM 
CM 



CM 
CM 



CM 
CM 
CM 



CM 
CM 



CO 



CM 



2 



TP 

I 

C 

o 

I 



: 



o 

I 

S 

IS 



CM 
CM 



22 




U 
"JO 



u 
o 

V) 
V) 

u 



o «0 ^ 



o E E o. 



o X 



j3 

O 



s 



li 

0) 

3 U 

O . 
UCNJ 

»- 4-> 

uu. 



c 



CO 
CM 



C 



<0 ^ U T3 



O. or 



U 



CM 



CO • 
• CO 

OCNJ u 

c m 
^ a c 



3 C 



4-» 



0) 



23 



very versatile scheme and can be adapted to Incorporate a broad set of 
characteristics. Figure 2.4 Illustrates a taxomony of coordinate 
retrieval systems, and various logical extensions to other types of 
Index systems. Of central relevance to this report are the post- 
coordinate Document Retrieval Systems that Incorporate Boolean opera- 
tors In the system language. 

2.4 THE INDEX FILE 

The Index file In a coordinate Index system consists of the 
descriptor/Index vocabulary and the descriptor tracings or assign- 
ments to the documents In the corpus. A sample of an actual Index 
vocabulary for the subject area of Information Science, Is given In 
Fig. 2.5, and a sample of a term frequency of use ranking Is presented 
In Fig. 2.6. 

Of particular Interest are the following characteristics of a 
coordinate Index system file: 

(1) the number of active terms In the vocabulary 

(2) the frequency of use of each term 

(3) the depth of Indexing for the documents In the corpus 

These characteristics are Indicative of the term-document distribution 
In the DRS which Is the basic relationship In these systems. It Is 
Important to realize that all these characteristics are (<ynam1c In na- 
ture. They will change as new Index terms are added, or created out 
of combinations of existing terms, and as new documents are added 
to — and old documents dropped from — the corpus. The Index vocabu- 
lary Is used by the system user to generate. In a post-coordinate sense. 
Inquiries to the DRS. 



24 



Documtnt Stt 




Indexing 



Ordinal Valuts Nonntllzed Binary 

of Victor Eltmnts Victor Im9M 



FI9. 2.4 Coordinate Index models 



25 



S « SFE 

SA « SfE ALSO 

SN « fN THE SFNSE f)f tI.E. SCOPE NOTE} 

* « NO OnCUMENTS YET INDEXED WITH THIS TERN 

♦ « TERn NOT ALLOWED f PELATEO TERN TO BE USED 



♦ABBREVIATION 
AftSTRACT 
ABSTRACT INC. 
ACCESS 

ACCESSION NUMBER 
ACCURACY 
ACQUISITION 
ADDRESS 

ADMINISTRATION 
A' -CBRA 

♦AL<;nL 

S PROG. LANGUAGE 
At-GPRITHN 
ALPHABETIC 
ALPHABETIC ORDER 
ALPHANUNER IC 

♦ ALTERNATIVES 
AMBIGUITY 
ANALOGY 
ANALYSIS 
ANSWER 

♦ ANTWaOGY 

SA BIBLIOGRAPHY 
APPLICATION 
♦ARITHMETIC 

S MATHEMATICS 

ARRAY 
♦ARTICLE 

S DOCUMENT 
ARTIFICIAL INTEL 
ASSIGNED 
ASSOCIATION 
ASSOCIATIVE 



♦ATTRIBUTE 

S CHARACTERISTIC 
AUTHOR 

AUTHORITY LIST 

SA THESAURUS 

AUTO ABSTRACTING 

AUTO. INDEXING 

AUTOMATIC 

AUTOMATION 

SA MECHANIZATION 



BATCH PROCESSING 

BIBLIOGRAPHIC 

BIBLIOGRAPHY 

SA ANTHOLOGY 
BINARY 
BOOK 
BOOLEAN 

SA LOGICAL 



CALL NUMBER 
CANONICAL 

SA NORMALUED 

CARD 

CARD CATALOG 

CATALOG 

CATALOGING 

CATEGORIES 

CENTERS 

CENTRALIZED 

CHARACTERISTIC 



Fig. 2.5 — Index Vocabulary niustrations (from Haron (98)) 



ERIC 



26 



INOEX TrpH 


NO. OF RFFS. 






INFO. RFTRIEVAL 


84 


SFARCH STRATEGY 


22 


SYSTFM 


84 


SYMBOL 


22 


OnCUMFNT 


78 


TECHNICAL 


22 


COHPUTER 


69 


AUTO. INDEXING 


21 


^TflRAGE 


69 


BIBLIOGRAPHIC 


21 


f NOEX ING 


64 


SCIENTIFIC 


21 


HETRIEVAl 


63 


v., STAT. METHOD 


21 


INFORMATION 


S9 


CONCEPT 


20 


SFARCHING 


S8 


EFFICIENCY 


2C 


ANALYSIS 


53 


RECALL 


20 


CLASSIFirATION 


52 


TEXT 


20 


STRllCnPC 


52 


THEORY 


20 


IMOEX 


49 


ABSTRACT 


19 


JIELFVANCF 


49 


Cn-OCCURPENCE 


19 


LANGUAOr 


46 


CODING 


19 


EVALUATION 


44 


KEYWORD 


19 


'•XffER IMENT 


44 


TRANSFORMATION 


19 


ASSOC lAT ION 


42 


HEIGHT 


19 


SFMANTIC 


41 


GRAPH 


18 


»<ATRIX 


39 


VOCABULARY 


18 


NATURAL LANOIACE 


38 


CLUMP 


17 


•40»0 


36 


HAROMARE 


17 


PREQUENCY 


35 


MODEL 


17 


HESCRIPTOR 


34 


SUBJECT 


17 


OUFSTION 


33 


SYNONYM 


17 


OICTIONARY 


32 


SYNTACTIC ANAL. 


17 


PROGRAM 


32 


TREE 


17 


USER 


32 


COMPARISON 


16 


OATA 


31 


COORDINATE INDEX 


16 


MFASURE 


31 


CORRELATION 


16 


TRANSLATION 


31 


MECHANIZATION 


16 


LIBRARY 


10 


TAG 


16 


'RELATIONSHIP 


3C 


TEST 


16 


THESAURUS 


10 


ACCESS 


15 


HIERARCHY 


29 


BIBLIOGRAPHY 


15 


Al GORITHM 


28 


CLASS IF. SCHEME 


15 


AUTOMAT IC 


28 


CONTENT 


15 


COMMUNICATION 


28 


COST 


15 


INPUT 


28 


EDUCATION 


15 


LINGUISTIC 


28 


LATTICE 


15 


STATISTICAL 


28 


LINK 


15 


SYNTAX 


28 


MATHEMATICAL 


15 


OROBARILfTY 


27 


RETRIEVAL SYSTEM 


15 


GRAMMAR 


26 


TITLE 


15 


OUTPUT 


26 


ASSOCIATIVE 


14 


QUEST ION- ANS4FR 


26 


MEANING 


14 


apPFRENCF 


26 


NETWORK 


14 


WORD ASSOCIATION 


25 


RESEARCH 


14 


LTTCRATURF 


24 


SCANNING 


14 


FILF 


22 


SERVICE 


14 


LOGIC 


22 


ABSTRACTING 


13 


MATCH 


22 


BOOLEAN 


13 


PROCFSS ING 


22 


CITATION INDEX 


13 


RtLFVANT 


22 



Fig. 2.6 — Index term list sorted on frequency of use (from 

Haron (98)) 



27 



2.5 INQUIRY FORHULATIONS 

The fundaiental coii|)onents of Inqulo formulations are— the user's 
need for Information, the system Inquiry vocabulary (the Index file), 
and the system Inquiry gramwr. 

The notion of user need for Information Is principally psychologi- 
cal In nature; It Is very 4ynam1c and directly dependent on the relative 
state of knowledge of the user. The reason for noting the user's need 
at this point Is primarily to Identify the source of the ORS worlcload 
or demand. The expressing of a need for Information, In the terms and 
grannatlcal structure of the system. Is the system Inquiry. It Is 
usually the case that the formal inquiry Is only a partially accurate 
representation of the "real" need on the part of the user. However, 

♦ 

for the purposes of this analysis the formal Inquiry will be taken as 
the complete system workload, as the system output variable of Interest 
Is quantity. The knotty Issues of distinguishing between felt-need, 
expressed request and formal Inquiry and their respective "noise" con- 
tribution to the relevance^ and nonrelevance of systems output are not 
dealt with. 

The fundamental components of the formal inquiry are the descriptor 
terms Incorporated In the Inquiry* and the grammatical operators used 
to "coordinate" the terms. The descriptor terms have been described, 
and the grammar used In DRSs will be discussed next. 

*There has been more analysis related to the concept of relevance- 
Its definitions, measurement and quantification than any other Informa- 
tion Retrieval System characteristic. To mention just a few, see Cooper 
(33,), Barhydt (5), Cuadra and Katter (36 , 37 , 38), Doyle (45), Salton 
(112), Swets (131), Swanson (129), and Naron and Kuhns (97). 



28 



2.5.1 Inquiry Grawnar 

The operational manner In which the descriptor terms are coordinated 
In an Inquiry Is defined by the system grammar. Of the class of Infor- 
mation Storage and Retlreval systems that this analysis deals with, the 
nature of the gramnar Is quite primitive; only certain explicit opera- 
tions/connections are permitted, between system controlled vocabulary 
terms. In the "coordination" process. 

The formal representation of coordinate retrieval system grainnars 
can take several **^vii6, A comnon representation Is In terms of a logi' 
cal language, for trample, a sentential or proposltlonal calculus. In 
this analysis, the rules of term combinations can be formally represent- 
ed by a Lattice Algebra,^ or Its less general proper subset. Boolean 
Algebra. In the rest of the discussion a Boolean Aloebra structure 
will be assumed. Essentially, the specifications of the reUtlxmshlir 
between two classes of objects Is what Boolean Algebra Is all about. 
Very briefly, this structure, for a defined set T and Its elements 
(A, &,..), Is defined In terms of the following operations. 

Conjunction; C « A*B, the subset or subclass of all Index 
terms or elements of T that are both In the 
subsets of A and B. 

Excellent presentations of Lattice theory are provided In Birkhoff 
(9) and Szasz (134), Applications to ORS theory can be found In Becker 
and Haiyes (6) and Salton (117). 

A Boolean Algebra Is defined as a distributive lattice In which 
each element "a" has a complement defined by Its negation. 



29 



Disjunction; D = A + B, the subset of all index te . or elements 

of T which are either in subset A or subset B. 
Negation;* N = -B or B, the subset of all index terms in T 
which are not in subset B. 

Figure 2.7 illustrates many of the different symbolic and graphical 
notations in use to represent the above logical operations. For this 
analysis the notations "." for conjuncti^. u "+" for disjunction and 
"-" for negation will be employed consistently. 

In sum, the inquiry language (grammar and vocabulary) is the 
vehicle to translate user's information needs into formal system in- 
quiries. Subsequent to the generation of the request, the next step 
is the search and retrieval process. 

2.6 Search Files and Retrieval Process 

A central DRS component is the storage or search file, which con- 
tains the descriptions of corpus documents. This file provides the 
means whereby formal requests are compared with the index descriptions 
of the documents. In a sense, there is an input indexing operation 
(on the documents), and an output indexing operation (on the user's 
request). Given that both requests and documents are represented by 

*There are variations on the operation Negation that can be used 
in DRSs; for example, Pratemegati on— implicit exclusion instead of ex- 
plicit exclusion, Soergel (125); Brouwerian Compliment— the smallest 
set of items that with certainty contains all the NEGATED elements, 
Salton (117); Psuedo Compliment— the largest set of items that with 
certainty contains no NEGATED elements, Salton (117). 

**In actuality, most DRSs have two search files; one for the docu- 
ment descriptor images, and one for the physical storage of the docu- 
ments. Only the former files are of concern here. 



30 





j AMD 




NOT 


•OlPt 


■NttftSCCTKM 

H»aOOCT 

COMAMCTtON 


COMiNAIiON 

Sum 

UNtON 
OlSJUNCTKM 


NfCtATiON 


ttwou 




" * A n 1 
A • t 

A A A • 
• A • 1 

ffo S»*C* A( 
^*rtntht(t( (A)(1) 


* «.). A * 1 
V A V 1 
A t 


COI«n.HICNT 

I . A.I 

aI 


ifmtniUTiMii 


CmCwK 


A 1 




A 










1 . 


— ^ 


^ A 






t 








GE) 

A «a4 t 




A fKJt 1 


TAMuMnmucmAnoM 

of PM(«1# V«tu»t <* A 
(4 »<««M}««i^ Hit ^»*<i¥m4 


I (Mill TMf 

i 1 

0 0 


0 




0 


0 1 

\ b 


e 

0 


1 
1 


0 

1 


WMmCflKMNTtO/TMMt 
aMMNTATIOII 


i \ 
NOiO) 


1 

A *nO t A • t 
t • 1 




1 

A «r 1 A*| 




0 

A AOt 1 A* ] 



2.7— Equivalent logical operations and notation 
(from Brandhorst (17)) 



31 



lists of index terms, the retrieval process consists of matching the 

two lists, and retrieving those documents whose descriptions sufficiently 

overlap or match the inquiry. 

The assignment of index terms to documents can be represented in 
matrix form. A hypothetic assignment of terms to documents is shown 
in Figure 2.8. In this example, the index terms are represented by 
the set T, and the set of documents by D, where T and D also represent 
the power of the respective finite sets and are usually not equal. As 
indicated in the example, the form of the term to document assignment is a 
binary operation, represented by the blank or zero and 1 notation; the 
latter representing assignment. While other assignment operations 
are possible, notably weighted assignments, the more common index opera- 
tions are binary, and will be the type assumed in this analysis. 

The search file can be represented in matrix (DXT) form, with the 
columns constituting the index term profile of the corpus, and the rows 
representing the meirbership of documents to the index term or concept 
sets. There are two basic arrangements for the search file, index term 
on documents (TXD) or the inverted file shown in Figure 2.9, and docu- 
ments on terms (DXT) as shown in Figure 2.8. The DXT arrangement is 
the usual output from the indexing operation, and the TXD (the trans- 
pose of DXT)is the more convenient form for searching and retrieving 
documents. The retrieval process consists of a subject search of the 
document descriptions. Several simple cases of subject searches are 
Illustrated in Figure 2.9. Search request (1) is a simple one de- 
scriptor inquiry, which would retrieve three documents . For this kind 
of search request those documents that belong to index sets defined 



32 



Set of Index Terms 
(T) 



Set of 
Documents 
(D) 



h 


h 


^3 


1 


1 


0 


0 


1 


0 


1 


1 


1 


1 


0 


0 


0 


1 


1 


1 


0 


1 



Fig. 2.8 — DXT matrix — assignment of terms to documents 



D, Dg D3 D4 D5 Dg 



^1 


1 


0 


1 


1 


0 


1 




1 


1 


1 


0 


1 


0 
















h 


0 


0 


1 


0 


1 


1 



Inquiry 
I- "^3 

II. Tg and T3 
III. T, and (T« or T,) 



Fig. 2.9 — Inverted TXD matrix, and sample inquiries 



Output 
D3.D5.Og 

D3.D5 

D,.D3.Dg 



33 



in the inquiry are retrieved, regardless if other index descriptors 
are also assigned to the specific documents. 

There are different retrieval strategies that can be used in coor- 
dinate index DRSs to select inquiry "rel^viint" documents. The two major 
strategies to be considered are direct match and word associations re- 
trieval. The simplest direct match request is the single term inquiry, 
already noted. The next and more common request is the conjunctive 
coordination of two or more terms. These logical product inquiries re- 
quire that the documents retrieved have all the inquiry terms assigned 
as subject descriptors, and the search result is defined as an exclus- 
sive mapping on the search file. That is, only those documents dealing 
with the inquiry "exclusively" are retrieved. Figure 2.10 illustrates 
an exclusive search by logical statements and Venn diagrams. 

A less restrictive direct match request is to disjunctively coor- 
dinate descriptors as a logical sum. In this type of inquiry each term 
is treated as a logical equivalent or synonym of every other term, and 
any document description containing one or more terms is retrieved. 
These logical sum inquiries result in an inclusive mapping on the search 
file. For the same set of inquiry terms, the inclusive search output 
will contain the exclusive search set". An illustration of an inclusive 
search logic is given in Figure 2.10. In general, inquiries will contain 
corrbinations of logical products and sums of index terms, and occasion- 
ally, negation of a term. Term Negation is treated in this analysis as 
the compliment of the logical product operation. 

The second retrieval strategy is word association searching, in 
v^ich the initial inquiry is expanded or broadened so as to retrieve 
more documents in the corpus that are "relevant" to the initial inquiry. 



34 




35 



Association retrieval techniques are based on the relationships between 
descriptor terms assigned to ttje DRS corpus. There are basically four 
categories of word relationships that can be used as a basis for Inquiry 
term augmentation: (1) Semantic relationships which manifest the meaning 
and context of term*, within a language, (2) Syntatic relationships which 
arise from terms as menters of word classes and with the class relation- 
ships In a structural (granmatlcal) sense, (3) Syndetic relationships 
which measure the manner by which words that are conjunctively co- 
ordinated with a given or base term cross-reference one another, and 
(4) statistical relationships which measure the frequency of occurrence 
of terms In a document. 

For this analysis, only the statistical association will be dis- 
cussed In that It Is the most conwon technique for Inquiry modification. 
The emphasis (In later chapters) will be on their operational defini- 
tion. As implied by the name, statistical term association does not 
address the semantic, syntatic or syndetic connections among terms; 
rather, it views terms as separate isolatable units and is based princi- 
pally on the frequency of terms usage within a given DRS corpus. The 
basic assumption is that, within the context of a given corpus, terms 
which are statistically correlated with one another are presumed to be 
meaningfully associated. Hence the implication is that if terms A and 
B were determined to be associated, for a given corpus, and term A ap- 
pears in a Inquiry that inquiry could be expanded by the disjunctive 
incorporation of term B to term A. The objective of including term B 
is to Increase the likelihood of retrieving a larger set of inquiry 
"relevant" documents from the corpus. 



36 



2.7 DRS - A BRIEF FORMAL DESCRIPTION 

The above discussions have been basically Informal, and It Is In- 
structive to consider what a formal statement of a DRS consists of. 
The advantages of a formal statement are: (1) that the elemental or 
basic components of the system and their relationships are defined, so 
as to provide a sound basis for Intra-system analysis, and (2) to fa- 
cilitate inter-system structural and operational comparisons. 

it 

Formally, a coordinate index DRS Is defined as consisting of: 

1. A set of distinct documents to be analyzed/Indexed 

D«{d,,...dp} 

2. A set of elementary descriptors/attributes/index terms from 
which compound-descriptors (combinations) can be constructed 

T « {t.|,. ..,tj.} — the elemental set of attributes 

T' « {t.|,...,tj} — the set of terms generic to set T 

and composed of combination of elements 
in T 

3. A set of statements/axlons which connect descriptors with docu- 
ments. This set of statements defines a homomorphic mapping 
between the set of descriptors T and the set of documents D. 
The mapping usually results In a binary set of assignments, 

n 

For extensive definitions of formal systems see Curry, et al. (39), 
and for an excellent dlsoosslon of a formal system definition of DRSs 
see Soergel (125). 



37 



{TC::>»{D} : DXT (binary) 

but 1t Is not necessarily restricted to 0 or 1 assignments; 
weighted assignments are also possible. 
4. A set of statements (theorems) derived from the axioms and 
the system grammar which define the manner of coordination 
and relationship of descriptors for searching and Inquiry 
specifications. 

2.8 RETRIEVAL SET CHARACTERISTICS 

It follows from the preceding discussions that the properties of 
the retrieval set are a function of three parameters: 

(1) the number of terms and the degree ind type of coordination 
In the Inquiry 

(2) the search strategy — either direct match or word asso- 
ciation 

(3) the DRS DXT distribution — from which all the DRS charac- 
teristics can be derived. 

The retrieval set characteristics are definable In terms of quan- 
tity and quality. The quality measure Is a reflection of the user's 
judgment of the relevance of the retrieved material. The quantity meas- 
ure Is simply the number of documents output In response to the Inquiry* 
and Is the retrieval set characteristic of Interest to this discussion. 

The principle task Is to define the quantity output as a function 
of the above noted parameters; Inquiry* search strategy and the DXT dis- 
tribution. Various hypotheses about the functional relationship and 
the parameters will be presented and analyzed In Chapters 4 arid 5. 
However, before addressing those Issues* a statement of how retrieval 



38 



quantity 1s related to existing DRS performance measures Is necessary 
to provide additional perspective for the measure as a management and 
design tool. 



39 



Perfomiance measures like sign posts guide the wi^.... 

Chapter 3 

RETRIEVAL QUANTin AND DRS PERFORMANCE MEASURES 

3.1 INTRODUCTION 

In this section, the need for a Retrieval Quantity (R^^) measure 
will be discussed, and the relationship of the proposed measure with 
other DRS performance measures will be noted. 

The tasks of design and management of DRSs require tools and per- 
formance measures to aid In the selection of preferred candidate options, 
and In the control over the fundamental processes of Inquiry analysis. 
Indexing, retrieval and system output. The designer needs tools that 
reflect the cause-effect relationships between the DRS building blocks 
of thesaurus, corpus and term-document distribution. Before a DRS Is 
built, the design should be assessed and compared to alternative de- 
signs. Existing DRSs require management tools to tune the system to 
meet the needs of the user, and to control the changes In the system 
due to growth In the thesaurus and corpus. Users of DRSs need guide- 
lines to construct and adjust Inquiries to more completely meet their 
Information needs, both In quantity and quality. 

Some of the tools and performance measures are available, and a 
basis for an overall analytic framework also exists, although a rigorous 
systems formulation has yet to be developed. A brief survey of a nunter 
of the measures that can be used for design and management will be pre- 
sented next, and the R relationship to the different measures briefly noted. 



40 



3.2 MEASURES FOR EVALUATION 

The primary purpose of a DRS Is to cost-effectively over time pro- 
vide the system users with the Information requested when It Is needed. 
The major dimensions of evaluation Implied In this objective statement 
Include: time* cost, flexibility, convenience of use. Information- 
quality, mii Information quantity. 

3.2.1 Response Time 

In general, for Information systems, the dimension of time reflects 
the period to perform an operation such as providing the user with a 
response to an Inquiry.* Lowe (92) and Hityes (63) have Investigated 
various time processing distinctions between dlffev^nt file organiza- 
tions for storage and retrieval operations. Also, It follows that the 
amount of time to process an Inquiry will be proportional to the thesau- 
rus size and term frequency of use distribution. In fact, Webster (145) 
has demonstrated that certain ORS dictionary searching techniques are 
critically affected by the term frequency of use distribution. In many 
ORSs the requests are batch processed, and '-^xn th£ user's point of 
view the response time Is fixed. However, the amount of "processing" 
time Is still of Interest to the system manager. In those systems In 
which there Is an on-line real-time environment, the user, by necessity* 
also becomes acutely aware of processing times. 

One possible way to anticipate required inquiry processing time 
Is to use the Inquiry as a basis for estimating the required search and 

A more restrictive definition of response time Is offered by 
Lancaster and Cllmenson (84) who define It as the average time required 
to obtain a satisfactory response from the system. 



41 



retrieval operations. The procedure for predicting the retrieval quan- 
tity measure (to be presented In Chapter 4) entails a set of iterative 
steps proportionate to the "complexity" of the Inquiry. Assuming a 
balanced file and dictionary look-up scheme In which each step takes 
approximately the same amount of time to process, by estimating the 
retrieval quantity, and keeping track of the number of Iterations 
required, the user and manager could gauge the Inquiry processing time 
and workload demands, respectively. 

3.2.2 System Costs 

Various recomn«endat1ons have been made for measures of cost- 
effectiveness for Information storage and retrieval systems. Overmeyer 
(105) has published a relatively detailed cost analysis of the American 
Society of Metals System of Vtestem Reserve University. Lancaster (85) 
discusses relevant system factors susceptible to cost-analysis, and sug- 
gests possible tradeoffs between input and output costs and between 
alternative candidate DRSs. Tell (137), Kochen (78), Bryant (23), 
Westat (147) and Lancaster (84), have developed ORS cost-analysis mod- 
els of various degrees of detail. Notwithstanding these efforts, a 
compreJ^ensive operational model for costing still remains to be de- 
veloped. A sound basis for DRS cost analysis appears to exist; for 
example, Lancaster (85) provides a subject relevant framework that 
could be coupled with the concept of opportunity costs and a well- 
developed system analysis setting, as in Fisher (51). It appears, 
however, that standard cost accounting methods cannot be conveniently 
or correctly carried over to DRS operations. As Marron (99) notes. 



42 



« corpus cf documents Is not really like or anilogous to equipment or 
machinery, particularly with regard to the concept of depreciation or 
mortl nation. Also, the costs and effort of constructing 3 corpus are 
not very sensitive to the deiiiand volume for services. As well, the 
problem of correctly tracing Input and operational cdsts Is particularly 
difficult when there are several Information services performed by the 
system; for example, dissemination, retrieval, abstracting, etc. Also, 
most ORSs operate In a non-market setting In which the users of the sys- 
tem do not "piiy** for the service, and the system does not "compete** 
to provide the service. This situation tends to complicate the costing 
of resources consumed and the estimation of benefits accrued. 

To some degree, the retrieval quantity estimate can aid In the 
costing of Inquiries by using th(* Inquiry processing time estlrate, 
noted above Is muVtlplled by a cost per unit processing time. Also, the cost 
estimation per Inquiry can help the user "balance" his needs with the 
probable system accrued costs. 

3.2.3 System Convenience of Use 

The principal Iss'** In the dimension of convenience of use Is the 
amount of effort that Is required from the system user to Interact with 
the DRS. To some degree the literature on man-machine Interaction has 
some bearing. Certainly the notion of unburdening Is relevant. In- 
vestlgaMons by Saracevic (120), Lancaster (82, 83), and Lesk and Salton 
(86) Indicate that there Is a need for user — se^arch analyst inter- 
action, but there Is no concensus as to whether the Interaction shouU. 
take place before the search or after the retrieval. There Is nc con- 
venlence-of-use measure of what Is efficient user-system Interaction- 



43 



Clearly, a fundamental parameter is the state of the user's need for 
Information. Martyn and Vickery (100) discuss a number of conditions 
affecting user need, and Voigt (142) has prepared an early (1959) but 
still accurate description of user nei^ds for information. It would 
seem that given a communicable information need, a retrieval quantity 
estimation process can aid in tuning the user's inquiry to the expected/ 
desired size of the response. This notion is discussed further in 
Chapter 6. 

3.2.4 System Flexibility 

Flexibility 15 meant to be a measure of the DRS's capacity for 
positive adaptation. An implemented ORS can only stay successfully 
operational if it is adaptive. Of interest for this measure is what 
do systems have to be adaptive for, and in what ways can this flexi- 
bility be built into the system structure. Ironically, most DRSs are 
justified on the basis of the rapid growth and rate of change of rele- 
vant literature, and yet the systems are designed for the point in time 
when they are implemented, with little regard given to the need for 
flexibility to accommodate system growth. In addition, to the inherent 
growth of the corpus and thesaurus, DRSs should also have a certain 
flexibility to adapt to changing user needs and behavior. One of the 
greatest faults of the traditional library classification schemes is 
the implicit assumption that all library users are counterpart mini- 
models of the classification scheme, and as well will never change. 
A more preferred state is one -n which a DRS would Interact with users 
at different levels of user proficiency, and grow in a controlled sense 
with the incorporation of new material. 



44 



An Important Impact of growth is that as the corpus and thesaurus 
changes the system output will be different at different points In 
time for the same Inquiry. The retrieval quantity measure can be used 
to gauge the Impact of corpus and thesaurus growth on the DRS output 
quantity, and in this dimension provides a measure of system adapt- 
ability. This application of the retrieval quantity estimate Is dis- 
cussed in Chapter 6. 

3.2.5 Retrieval Quality 

Measures of retrieval quality have by far received the most atten- 
tion of the DRS dimensions of evaluation. By retrieval quality It Is 
meant the relevance, pertinence or correctness of the retrieval docu- 
ment Information tn the user's Information need. 

For any document corpus, only a fraction of the collection will 
contain relevant Information regarding a specific usef Inquiry. For 
example. If there are D documents In the corpus, then only R may be 
relevant to the particular Inquiry. Without the entire set D being re- 
trieved. It Is unlikely that all R relevant documents will be retrieved 
In any one search. Initiated by the Inquiry. Usually, only a fraction 
H of the R relevant documents are retrieved, and by definition M = R-H 
will be missed. Also it is usually the case that a number of I Irrele- 
vant documents will be retrieved by the system in response to the In- 
quiry. Following Vickery (140) these characteristics of a DRS and the 
retrieval set are represented In a two-by-two contingency table as 
shown In Fig. 3.1. For this binary construction, all the D documents 
in the system are accounted for, with respect to the inquiry which gen- 
erated the retrieval set.^ Namely, 



45 





Relevant 


Not Relevant 




Retrieved 


(Good 

a 

Hits) 


(Bad 

b 

Hits) 


a + b = H 


Not 
Retrieved 


(Bad 
^ Misses) 


(Good 

d 

Misses) 


c + d = M 




a + c = R 


b + d = I 


0 



Fig. 3.1 — Two x two contingency table of an inquiry 

response (140) 



46 



a documents are good hits because, aCR and aCH 
c documents are bad misses because, cCR and cjlH 
Presuming, of course, in this simple system that it is desirable to re- 
trieve all R relevant documents. Also included in the retrieved set H 
are: 

b documents which are bad hits because, bcl and bcH 
and the remaining, 

d documents are good misses because, del and dffH. 
From the two-by-two contingency table in Fig. 3.1, a plethora of 
retrieval efficiency measures, primarily directed at assessing rele- 
vance/quality, have been derived. Table 3.1 lists a sample of the 
derivable measures. Fundamental to all of these measures are two vari- 
ables — a relevance judgment and the quantity of documents (relevant 
and/or irrelevant) output. The close relationship between output in- 
formation quality and quantity in these measures is clearly evident. 
A predominant characteristic of these measures is that they are all 
designed to be computed after the retrieval operation, and consequently 
are of limited use to predict output or the effect of a system change. 
The retrieval quantity estimate is a step in the direction of develop- 
ing management tools for predicting retrieval output and impacts due 
to system change. 

A review of previous attempts to construct a Retrieval Quantity 
estimate, and the suggested methodology to predict R^, developed in this 
analysis, are presented in the next chapter. ' . 



47 



Table 3.1 
RETRIEVAL SET MEASURES 



Measure 


Equation (based on 
Figure 3.1) 


Resolution factor (106) 


a+b 
D 


Elimination factor (106) 


c+d 
D 


Pertinency factor (infi) 
(Relevance measure) ^ ' 

Noise factor (106) 


a 
H 

b 
R 


Recall factor (106) 


a 
R 


Omission factor (106) 


r 

^ (Type I error) 


Generality ratio (31, 32)) 
Concentration ratio (47) ) 

Fall out (69) 


D 

b 
I 


Specificity (113) 


d 
I 


Distillation factor (47) 
Discrimination factor (47) 
False acceptance (101) 


ad-bc 
(a+b) (c+d) 

ad-bc 
{a+c){b+d) 

^ (Type II error) 



48 



Chapter 4 

RETRIEVAL QUANTITY ESTIMATION: LITERATURE REVIEW 
AND PROPOSED METHODOLOGY 

4.1 INTRODUCTION 

The main body of this chapter is concerned with a review of past 
work related to retrieval quantity estimation. The second part of this 
chapter describes the proposed methodology for prediction of output 
quantity. 

4.2 GENERAL CRITIQUE OF PREVIOUS RESEARCH 

Surprisingly there have not been many analyses of the output quan- 
tity of DRSs; the review that follows is quite exhaustive. Though vari- 
ous approaches have been employed, all the research to date on the 
determination of retrieval quantity has. either implicitly or explicitly, 
been based on the assumption that index terms are used as though they 
are independent of one another. The general lack of qualification or 
modification of this assumption has been the rather pervasive Achilles' 
heel of the efforts to date. This is so because index terms do not 
occur or co-occur as though they are independent of one another. To 
assume that they do exhibit independence causes large divergences be- 
tween actual and "theoretical" values of term co-occurrence and output 
quantity. 

The earliest attempt to estimate retrieval quantity appears to have 
been by Bemier (7). in which the following argument is made. For a 
system of D documents, T descriptors* a uniform depth of indexing of t 
descriptors per document, with no two documents possessing an identi- 
cal set of descriptors and indexing being an "essentially-random" 



49 



assignment process, a n-tenn conjunctively coordinated inquiry has the 
following probability of retrieving at least one document; 

P(Rq i 1) = D(f)" 

This model is quite "hypothetic" due to the very restrictive assump- 
tions which limits its usefulness. First, terms are not assiqned as 
though they are balls being selected randomly from an urn? secondly, 
the depth of indexing distribution of DRSs is anything but uniformt 
and thirdly, for systems of even moderate size the above probabilities 
so small as to provide almost no insight into the retrieval process. 

A more ambitious attempt was made by A. D. Little (1, 2) in which 
a model to predict the average number of documents to be retrieved for 
a given inquiry is constructed. The expected number of documents re- 
trieved is defined as a function of: 

(1) the number of terms coordinated in the inquiry (only con- 
junctive inquiries were used) ~ n 

(2) the number of documents in the '•orpus — D 

(3) the average depth of indexing ~ q 

(4) the frequency of use distribution of index terms — which 

is approximated by a geometric series and incorporated in 
the function by a factor (l-6)/2, 3 < 1 

(5) the term usage distribution for users generating 
inquiries ~S 

(6) the index term correlation for indexing documents 

(the assumption of independent term, usage with a correc- 
tion factor was employed) — S 



50 



(7) the effect of a system "requestor" to aid in the specification 
of search inquiries (an implicit factor) 

with the resulting function: 



This model, though containing many system and inquiry charucter- 
istics, does not perform very well at all as shown in 4.1. The 
assumption of term-term independence is the principal factor. Also 
the assumption for the inquiry terms selection distribution, while 
not an essential ingredient for determining retrieval quantity, is 
not necessarily the same distribution as for terms used to describe 
documents. 

A more abstract approach is suggested by Switzer (133) who 
employed a term-term distance measure to estimate the elements of the 
term correlation matrix (TXT). Switzer does not estimate the expected 
number of documents to be retrieved for an inquiry, but does note that 
once the term-term couplets are estimated, the logical extension to 
evaluating term combinations in inquiries is possible. The principle 
assumptions in this analysis are: 

(1) the normalized co-occurrences are considered to be probabili- 
ties (a frequency interpretation of probability is implied ) 

(2) the term co-occurrences are hypergeometrically distributed 
The proposed relationship for the value of the couplet of terms a and 
b is: 




for n > 2. 



51 




Theoretical 

Fig. 4J — Comparison of actual nunber of documents retrieved 
with theoretical number based on assui^)t1on of term* 
term Independency - (from Ref. 2) 



52 



N 



ab _ 

ab \i 



which Is the hypergeometrl c distribution with parameters 
= the number of co-occurrences for terms a and b 
0 = the number of documents In the corpus 
= the nuirber of times term a has been used 

a 

= the number of times term b has been used 

Switzer did not empirically test this relationship, but It Is 
clear from the fundamental assumptions of hypergeometrl city, which Is 
random sampling from a finite population without replacement, that It 
Is not correct. As noted previously, term-term co-occurrences do not 
occur as though they are the result of a random sample. 

One of the more Interesting formulations to estimate document 
output Is presented by Raver (111), In which the term frequency of use 
distribution Is approximated by a normalized log function. The explicit 
distinction between a normalized and unnormalized term frequency of use 
distribution is very useful. In addition. Raver notes that all Boolean 
combinations of terms are reducible^definable by the "and" and "or" 
operators with tnose terms. 

The logrithmic relationship between the frequency of term use and 
the term rank (in which the term with greatest use is given rank 1, 
the next most used term rank 2, and so on) is of the form: 



53 



for a normalized distribution, 

where N' = most frequently used descriptor (normalized) 



T = total number of active descriptor terms out of a the- 
saurus of size t 
r = rank of the term; Oi r s T and Is defined by 

0 when =fN for unnormallzed distributions 

h/fjnin normalized; N' - N/f^^„ 



'Cl fo> 
C^mln 



T when =]1 for normalized distributions 



for unnormallzed distributions 



where f^j^pls the frequency of use of the least used term In the active 
subset of the thesaurus. 

Obviously, In those systems In which f^^^^ = 1, the term frequency 
of use distribution Is automatically normalized. An Illustration of 
the normalized term frequency of use distribution Is given In Fig. 4.2. 

From the above relationship. Raver then shows that 

(a) the average number of documents per descriptor Is: 

(b) the average number of descriptors per document is: 

T 

0 T 

(c) the average number of documents to be retrieved for an n 
term (conjunctive) Inquiry Is: 



54 




55 



The last relationship given Is also extended to disjunctive C9nt)1na- 
tlons of tenns by noting that, "for r total documents equal to the sum 
of all 'or' terms, ihe expected number of different documents out of J 
documents will be," 

'-•'adjusted ■ - !>'" 

In going thr ^gh the above derivations It becomes clear that terms 

are assumed to be Independently assigned, and that term co-occurrences 

are Independently distributed. Consequently the Raver estimations do 

diverge greatly from actuality. However, this derivation Is unique In 

that the term frequency of use distribution Is, albeit Implicitly, 

assumed .to be of some standard form represented' by a stable class of 

functions — In this case the log function. This notion, as well as 

that of the need to explicitly normalize the term-frequency of use 

versus rank distributions. Is used In the proposed methodology In this 
report. 

A different perspective Is taken by King and Bryant (23) who deal 
with the Issue of quantity output In the context of an overall system 
evaluation scheme. In which relative frequency of Indexing conslstenpy 
(aggregated over the thesaurus, Indexers and corpus) et a point In time 
1$ determinable. As such, the expected number of documents Isr simply 
the number of documents In the file that are relevant to the Inquiry. 
That Is, If a K term Inquiry were submitted with the conjunctive require- 
ment that the retrieved documents be described by all K terms, then 



56 



the expected number of documents retrieved would be: 

K n^ n« 
ng.ng"! 3 
s.t. ng+n^-K 

where « the portion of the corpus that "should" contain of 
the K terms In the Inquiry 
p2 ■* the relative frequency of a document being Indexed when 

It should not be (a Type II error) 
P3 « the relative frequency of a document being Indexed 

when It should be Indexed by the Inquiry terms 
0 • the number of documents In the corpus. 
The above analysis Is basically dependent on the assumption of 
Independence of Indexing errors^ That Is to say* If the assignment of 
terms to documents Is sufficiently consistent, a norm can be observed 
about which statistical fluctuations will sum to zero — If the Index- 
ing errors are Indeed Independently distributed. A second assumption 
Is that there exists the ability to determine the fraction of the cor- 
pus that should contain (be Indexed by) the terms In the request. 
Neither of these assumptions seems operationally practical, and Is 
rather an awkward basis for determining R^. Clearly, one of the de- 
sirable attributes of an operational estimation process Is that It 
not require unw1e14y computations, or data not readily available. 

Another somewhat different scheme, which Indirectly addresses tiie 
issue of quantity output. Is Investigated by Shumwaiy (113). This pro- 
cedure Involves an estimate of the total number of relevant documents 



57 



In a corpus, through the use of sampling techniques comnon to probit 
analysis, and then estimating, with appropriate confidence intervals, 
the nunter of documents necessary to output In order to retrieve a 
certain spec'fled quantity of the relevant documents In the corpus. 

The estimation process entails taking an Initial sample For which 
the recall ratio (see F1g. 3.1) Is determined. Then a second sample 
(of the same size) Is taken, and based on the overlap of common "rele- 
vant" docxents an estimate of the total set of relevant documents In 
the corpus Is made. This technique Involves the use of the hyp6rgeo- 
metrlc distribution, and requires that the samples be random.* The 
result of the sample sequence Is used to construct a search character- 
istic curve which measures or reflects the number of documents needed 
to be retrieved In order to get a certain npber of relevant documents. 

Hiederkehr (148) also utilizes the search characteristic curve 
to estimate quantity output, and presents the interesting notion that 
any search strategy has an equivalent series of single stage random 
searches to generate the desired number of relevant documents In the 
corpus. The notion of defining a search Inquiry as a multiple of single 
stage random searches Is very useful, and will be Incorporated in the 
proposed methodology discussed in the next section of this chapter. 

The usefulness of a search characteristic curve is limited by the 
requirements for data sampling, and the judgment consistencyof what 
is or is not relevant to ar arbitrary inquiry. Alsp, the distribution 
characteristics essential to the probit/hypergeometric are not suffi- 
ciently satisfied by a DRS. 

For a more complete discussion of this procedure see Feller (50). 



58 



In sunmary, the principal efforts to date have not developed an 
operational procedure for estimating that could be useful to a 
manager or designer of a DRS. The common assumption of random assign- 
ments of descriptors to documents or Its equivalent term- term Indepen- 
ncy assumption Is uot satisfied by actual DRSs. In addition, those 
procedures that could, albeit Indirectly, lead to estimates of R^^ re- 
quire an Impractical amount of data and extensive relevance judgments. 

4.3 PROPOSED METHODOLOGY FOR DEVELOPING THE Rg MEASURE 

As noted, previous attempts to construct a retrieval quantity 
measure have. In general, failed to correctly represent the charac- 
teristics of DRS components, and also have not taken advantage of the 
statistical regularity common to certain components of coordinate 
indexed DRSs. 

At the onset of developing an operational tool, it is advantageous 
to indicate the desirable characteristics that the measure should possess. 
Four such characteristics are: 

1. The R. measure should be defined in terms of the basic DRS 
components (or their equivalent distributions). 

2. The measure should use data that is convenient to obtain in 
operational DRS settings, and easy to construct for those 
systems in the design stage. 

3. The value of the measure should be easy to compute. 

4. The measure should possess stability to allow (in the dimen- 
sion it measures) — (a) monitoring of intra-system changes, 
and (b) inter-system comparisons. 



59 



As a preamble to the specifications of the measure, a brief 
review of the basic DRS components, relationships and characteristics 
win be given. Where a DRS cljaracteristic or relationship is recom- 
mended for incorporation in the measure, a hypothesis will be made 
about the particular system property. In Chapter 5, the various hy- 
potheses stated in this chapter will be analyzed for acceptance or 
rejection. 

4.3.1 Fundamental DRS Relationships 

As noted in an earlier chapter, the basic DRS components are: 

(a) the system corpus ~ D 

(b) the system thesaurus — T 

(c) the term-document distribution — DXT 

The DXT distribution is the basis from which all other DRS charac- 
teristics are derived. For the class of DRSs of interest to this analy- 
sis the DXT matrix is binary, and a hypothetic example is given in 
Fig. 4.3. 

If one arrays the columns in the DXT matrix such that the term 

with the greatest frequency of use is given rank 1, and the second most 

frequently used term given rank 2, and so on, the resulting DXT matrix 

can be represented by the term frequency of use distribution in Fig. 4.2, 

Note that Jthe most frequently used descriptor is assigned N„^^, as the 

max 

highest frequency of use, and the least used (>0) descriptor N 

min 

When Hj^^^ = 1, the frequency-rank distribution is effectively normal- 
ized. However, if H^.^ > 1, as is the c. in certain truncated dis- 
tributions, the distribution can be normaliii-ed by the division of N_,. , 
* min 

as indicated in Fig, 4.2. 



60 



The term frequency of use distribution is a commonly available 
ORS statistic, and a preferred data source for the estimation process. 
In order to formally describe the term frequency of use distribution 
two hypotheses will be offered: 

I. The term-frequency-of-use versus rank distribution is a de- 
creasing concave (convex) function in ordinal (log-log) space, 
and is closely approximated by the Mandelbrot-Estot Zipf (MEZ) 
distribution. 

The Mandelbrot-Estoup-Zipf* (94, 95, 96, 153) relationship is de- 
fined for the distribution of word frequency in an unrestricted language 
in which the relative probability of occurrence of a word or term is 
defined to be 

P- = K(r. + B)-« 
i 

where R. = the rank of term i that is used N^. times 

P = probability of occurrence of the term i (with rank r.) 
1 "I 

.P»*o 

K = e ; derived from the exponential law for optimum codes; 

-3.t 

for this application e " is a constant to be determined 
empirically 

a = 3^/3; also a constant to be determined empirically. 
The basic form of the MEZ canonical form is illustrated in log-log 
space in Fig. 4.4. For comparison, the more specific Zipf's Law (a spe- 
cial case of the MEZ form) is also indicated. 

The term-frequency of use distribution is a representation of , the 
column marginals of the DXT distribution. Taking the row marginals of 



61 



Log 




Log r 



Fig. 4.4 — Tern frequency of occurrence versus rank In 

log-log space 



Frequency 
of 

Occurrency 




Depth of Indexing 



Fig. 4.5 Typical depth of Indexing dIstHbutlon 



62 

the DXT matrix yields the depth of indexing distribution. A typical 
depth of indexing distribution is illustrated in Fig. 4.5. This dis- 
tribution displays the assignment of terms to documents, and will be 
referred to again in a later chapter as a source of data to define the 
degree of homogeneity of a DRS corpus. 

An additional DRS characteristic, and one of central relevance 
to the Rq measure, is the term-term (TXT) correlation matrix. This 
matrix is defined as follows: 

(DXT)^DXT = TXT 

The element TXT(i,j) represents the number of co-occurrences of 
term i and j. For example, if term i and j were assigned to the same 
n documents, the value of TXT(i,j) would be n. Another wa^y of defin- 
ing the elements of TXT is that they are the inner product of the i*^ 
column vector of DXT with the j*^ column vector of DXT. Also the matrix TXT 
is a symmetric distribution. 

Having defined the TXT distribution, the second hypothesis about 
the term frequency of use distribution can be made: 

II. The term-term co-occurrence distribution is not generated 
by a process which selects terms for assignment independent 
of one another. 

Since the R^ measure emphasizes quantity, a third hypothesis is 
of interest, and is also based on the data in the TXT distribution: 
III. Terms with the same frequency of use have essentially the 
same statistical characteHstics in the TXT distribution. 



63 



4.3.2 Inquiry Definition and Generation 

The process of inquiry generation is initiated by the system user, 
who upon "experiencing" a need for information, converts that need 
into a "natural language" request, and then interprets (with or without 
the aid of the DRS personnel) the request into a formal DRS inquiry. 
A formal inquiry is defined as consisting of terms from the system 
thesaurus that are coordinated in accordance with the system grammar. 
The rules of coordination to be used in this analysis are defined by 
Boolean Algebra. The explicit operations used for term coordination 
are: Union or logical sum(+). Intersection or logical product (•)» 
and exclusion or logical negation (-). 

The pertinent characteristics of the inquiry are the form (the 
nuiriber of terms and operators by type) arid the frequency of use of the 
terms. The semantic characteristics of the inquiry are not used in 
the Rq determination, as it Is assumed that terms with the same fre- 
quency of occurrence have essentially the same term-term co-occurrence 
characteristics (Hypothesis III above). This assumption, which is 
proven in the next chapter, simplifies the "inquiry generation process 
for developing hypothetic DRS workloads for DRSs in the design stage. 

4.3.3 Inquiry — Retrieval Quc itity Measure Relationship 

The basic variables relating inquiry terms to documents retrieved 

are: 

(1) term-frequency of use (f(i)) 

(2) term-term co-occurrence values (TXT(i,j)) 



64 



For the logical operators of "+", and the following 
relationships hold for elementary two-term inquiries: ■ 



Request 


- Inquiry 


Output Quantity 




1 


f(i) 


T . and T . 

* J 


i-j 


TXT(i,j) 


T, or Tj 


i+j 


f(i)+f(a) - TXT(i,j) 


T. and not T. 


i-j 


f(i) - TXT(i,j) 



Therefore, for all elementary two-term inquiries, knowledge of the 
term frequencies of use and their co-o.ccurrence value is sufficient 
to determine the output quantity. For more complex inquiries in which 
many terms are coordinated the determination of is not so simple. 
It follows, however, that if the single terms in the above example were 
replaced by groups of, say, conjunctively related terms, the same re- 
lationships would hold. For exan^le, given groups E and F with a logi- 
cal product O^p, the following is true: 



Inquiry 


Output Quantity 


E-F 


Oef 


E+F 


V^f-Oef 


E-F 


Oe - Oef 



The above relationships hold for sequences of disjunctively re- 
lated groups of conjunctively coordinated terms, or for conjunctively 
related sequences of disjunctively coordinated terms. It can be shown 
that any retrieval specification (in the proposltional or predicate 
calculus) on the set of thesaurus terms can be represented In disijuoc- 



65 



tive or conjunctive normal form. A disjunctive normal form is a dis- 
junction of clauses with no repetitiOB of terms within the claases. 
A clause is simply a finite conjunction of terms (where negation is 
defined as a negative conjunction). Also every disjunctive normal 
form has a dual conjunctive normal form (53, 103, 108, 109). Thus, no 
matter how complex, an inquiry can be converted to a string of clauses 
that can be evaluated for quantity output, as per the relationships 
in the above example. The crucial value to determine is the logical 
product. 

4.4 HYPOTHESES FOR RETRIEVAL QUANTITY ESTIMATIONS 

Given the search strategy of direct match, two methods of esti- 
mating the logical product of inquiry terms, and the value of R^ are 
discussed in this section. 

The problem of determining R^ for a multiterm inquiry is illus- 
trated by the following example. For an n te-'n disjunctively coordi- 
nated inquiry, ^"^+12+.. .+T^. the estimate of is: 

Rn = f(l) + f(2) +...+ f(n) - Logical Product/, ^ 

The simplest model for estimating the logical product of two or 
more terms is one that assumes that the descriptor assignment to a docu- 
ment is a random assignment. This case has been noted as being basic- 
ally Incorrect; however, it can be employed as a stepping stone to 
an eventual solution. For this model, the logical product of two or 
more terms is: 



ERIC I 



66 

t 

Logical Product,, . - 

and, R' for a n term disjunctive inquiry, T,+T,+...+T iu 
= f(i) + f(j) f(n) - ^^^^'^^^j^'^^"^ 

It can be shown that the actual value of the logical product and 
the "random-case" values do diverge significantly. However, if one 
makes the hypothesis: 

IV. There exists a stable statistical relationship between the 
actual term-tenn distribution and the hypothetical "random 
case" distribution, 
then the above formulation yieldinq can be modified to yield 
an accurate estimator of R^. From the above hypothesis the proposed 
modification is: 

= Y R ' 
q q 

or what is equivalent 

Logical Product/, , „n 'Yi o /f(l)-f(2)...f(£)) 
(1»2 n) '1,2 n^ ^n-l ' 

This hypothesis will be tested for acceptance or rejection in 
Chapter 5. If the hypothesis is accepted then a very convenient method 
^ for estimating R^ will be available. 

* 

A statistical test of an actual DRS is performed in Chapter 5 to 
demonstrate that the distribution of logical products of terms is not 
equivalent to a "random-distribution." 



67 



Given that the proportion y proves to be acceptable, the proposed 
utilization of the proportion for multi-term inquiries is illustrated 
in the following example: 

Inquiry: 'Ig'T^'T^ 

Estimation of R^: 



A second model for estimating the logical product of two or more 
terms can be constructed by using the Row (MR) marg'.ials end column 
(CR) marginals, and the total sum (TS) of marginals for the TXT matrix. 
For this model, the expected value of the logical product of two or more 
terms is: 



q 






Logical Product^^g n) = S -"T? 




where MR. 



sum of the term co-occurrences in Row i — for term i 



with terms 1 



> • • « , 



T 




sum of the term co-occurrences in column j — for term j 



with terms 1 



T 



T T 



TS 



I MR^ = I MC. 
k»l k=l ^ 



•I 



and, Rq for 
is: 



an n term disjunctively coordinated inquiry, "'"^.+T.+. ..+T| 



68 



^(MR.)(MC.) 



Rq » f(i) + f(j) +...+ f(n) - 2, — ^ 



where an analogo'is hypothesis, to model 1, is 



or what is equivalent 



Logical Product/, , „\ = X, , 

\\,c.,...,n) ltd n 



r MR.-MC 



Using experimental data in Chapter 5, the above relationships will 
be tested to determine if they can be accepted or rejected for use as an 
operational tool. 



69 



All the business of life. Is to endeavor to find out 
what you don't know by what you do 

The Duke of Wellington 
Chapter 5 

THE RETRIEVAL QUANTITY MEASUFtE: EXPERIMENTS AND RESULTS 

5.1 INTRODUCTION 

The purpose of this chapter Is to analyze the various hypotheses 
made, thus far In this report, about the fundamental characteristics 
and relationships of coordinate- Index DRSs, and to construct and test 
an operational R^ estimation model for systems that are established 
or In the design stage. 

In the preceding chapter the following hypotheses about DRSs were 
stated: 

(1) the term-frequency-of-use versus term rank distribution Is 

a monotonlcally decreasing concave function In log-log space, 
and Is closely approximated by the M-E-Z canonical form. 

(2) the term- term co-occurrence distribution Is not generated by 
a process which selects terms for assignment Independent of 
one another; that Is to say* the term co-occurrence distri- 
bution Is not the result of random sampling from the the- 
saurus. 

(3} the co-occurrence value of two terms Is directly proportional 
to a function of the frequencies of use for the tsrms, and 
can be predicted as a function of that factor. 

(4) terms with the same frequency of use have essentially the 
■...same statistical characteristics. That Is, two terms 1 and 



70 



it with frequencies of use f(1) * f(j) will have approxi- 
mately the same nunnber of co-occurrences with other terms 
In the thesaurus. 
(5) the Retrieval Quantity (R^) of a coordinate index DRS can 
be predicted for formal Inquiries. 

One of the principle alms of this chapter Is the analysis of these 
hypotheses for acceptance or rejection. The required experiments 
and analyses for this task and for the construction of the R^ model 
are discussed next. 

5.2 EXPERIMENTS; SEHING AND DESCRIPTION 

Experiments for the analysis of the above hypotheses were per- 
formed at the Institute of Llbraiy Research Information Processing 
Laboratory at the University of California, Berkeley, California. 
At the time of the experiments, the Laboratory facilities consisted 
of three Sanders CRT-remote on-i; ?e '-.ermlnals to a IBM 360, Model 40, 
128K system. The CRTs had keyboard Input and visual display output, 
and were capable of simultaneous operation. 

The Laboratory system was equipped with three search grannars, 
and eight word association files (including direct match search capa- 
bility). 

The experiments were set to take place over a period of time in 
which the Laboratory DRS corpus and thesaurus were expanded. The 
original plan called for a three-stage growth sequence, out only the 
first and second stages were realiaed. The system characteristics for 
the two stages are tabiulated in Table 5.1, and the term-freqoericy of 



71 



Table 5.1 
ILR DOCUMENT RETRIEVAL SYSTEM 



Characteristics 


' Stage 1 


Stage 2 


Corpus 
(documents) 


- 300 


400 


Thesaurus 
(terms) 


368 

(348 active) 


393 

(375 actTve) 


Average depth 
of indexing 


14 


12-13 


Average term 
usage 


3-4 


3-4 



Table 5.2 
"ATA BASE SAMPLE 



Characteristics 




Corpus 


102 


Thesaurus 


:^320 




(307 active). 


Average depth 




of indexing 


14 


Average term 




usage 


3-4 



4 



72 

use versus tern rank distribution for the system, at the end of stage 
two, is shown in Fig. 5.1. Samples of the system thesaurus and term- 
document assignments are included in Appendix B. The DRS corpus is 
composed exclusively of documents on information science, and. can be 
appropriately classified as being ^ogeneous. For a more complete 
description of the Laboratory and its research projects see Maron, et 
al. (98). 

5.2.1 Experiments and Analysis 

The data collection and analysis involved several steps. The 
first consisted of gathering of the DRS responses, over the two stages 
of system growth, for a set of formal inquiries. The second step 
entailed an analysis of a data sample from the DRS term-document dis- 
tribution* and the third, the evaluation of the retrieval quantity 
model. In the next two sections, 5.3 and 5.4, all these steps are 
discussed in detail, and the hypotheses are analyzed for acceptance or 
rejection. . 

STf DOCUMENT RETRIEVAL SYSTEMS - C0M10N CHARACTERISTICS 

In this section, the issues of statistical regularity among co- 
ordinate indexed DRSs, and the data analysis which demonstrates the 
statistical similarity of the test system to other DRSs, of different 
size and subject matter, are discussed. 

c 

A niMber of researchers, Brookes (20), Fairthorne (49), Mandelbrot 

(94, 95, 96), to mention a few, have observed that there are certain 

statistical regularities common to a variety of documentation systems 

and activities. Fairthorne (49), in fact, presents a brief survey of 
* 

See Appendix C for a description of the data sample. 



74 

* 

this topic. All of these findings revolve around the concept that 
the underlying behavior of DRSs is "hyperbolic" in nature (49). 

Of interest to this analysis are .he characteristics of derived- 
manipulative indexed DRSs that exhibit similar properties, independ- 
ent of systems -^ize and subject matter. The basic relationship for 
DRSs is the index term — document distribution, from which all the 
term-term, and document-document functional relationships can be de- 
rived. Therefore, if the term-document distributions of different DRSs 
can be shown to be statistically similar, or definable by an analytic/ 
canonical form, the argument for statistical regularity among DRSs can 
be accepted. T^he principal vehicle for showing this is the term- 
frequency-of-use distribution. 

5.3.1 The Term-Frequency-of-Use Distribution 

The preferred characteristic to use to determine if there is a 
statistical similarity among DRSs is the term-document (TXD) distri- 
bution. However, the TXD distribution is awkward to deal with and is 
rarely ever published* Thus the strategy taken is to i • surrogate 
distributions; namely, the term- frequency-of -use versus term rank, the 
term usage versus the cumulative -frequency distribution, and the depth 
of indexing distribution. The first two distributions, in particular, 
are readily available from published research and all three distribu- 
tions are ct enient to illustrate. The relationships between these 
distributions and the TXD matrix are illustrated in Fig. 5.2. 

A richer but unfortunately abstruse discussion is given by 
Mandelbrot (94-96). 



TenK 
12... 



Oocunents 



DXTd.j) = 



1 



Frequency 
of 

Occurrence 




0 Depth of Indexing T 



Teni Frequency 
of Usage 



Tern rank 



Log(tenii 
usage) 



Log rank 



Log(tenii 
usage) 



° Cumulative distHbu- 
tlon of utilization 
of thesaurus 



Fig« 5«2 — Illustration ofHrelationships bettireen the term document 
matrix and the term frequent of use vs. rank d1str1bution» and 
the term usage vs. cumulative usage distribution, and the 
depth of indexing distribution 



76 



Figure 5.3 shows in log-loy space the term frequency distribution 
for the test system sample, the test system, and the three larger DRSs 
investigated by Litofsky (90). All the curves are concave meootonically 
decreasing relationships. The two DRSs investigated by A. D. Little 
(1) are shown in Fig. 5.4, and these systems also display the same con- 
cave monotonlcally decreasing term frequency of use versus rank in log- 
log space. It is important to note that these systems are terrifically 
different in size, and have different subjects for corpus content. 

In addition, Houston and Mall (68) and Wall (143) have analyzed 
some 14 DRSs and plotted their term-frequenpy of use versus the cumu- 
lative percent of thesaurus utilization.* Their plots are reproduced 
in Figs. 5.5 and 5.6. All the systems plotted exhibit a remarkable 
linearity for the postings per term versus the cumulative distribu- 
tion, which lead Houston and Mall to conclude that the number of terms 
T in a system vocabulary varies directly with the log of TU, the total 
niATber of term uses, and has the form: 

T = a Log^Q(TU + b) - c 

ere a = 3300 
b = 10000 
c = 12600 

for values of TU between 10,000 and 1,000,000. As further evidence 
of statistical regularity, the three systems analyzed by Litofsky (90) 
and the ILR systems are plotted in the Houston-Wall dimensions. These 
it ^ - 

Fairthome (49) points out that the two methods of illustration 
are just different ytays of showing the same characteristics. 



77 




78 



100^000 cr 



10,000 



1,000 



U 

c 

D 



100 



10 
















— 




\ 






\ 






\ — 


Ill 1 1 1 1 1 


I 1 i 1 1 Ml 


1 II 



10 



100 



1000 



Rank 



rig,5,4— Term frequency of use versus tc^k for a 10 percent 
sample of the Industrial Collection Syslem investigated 
by A.D. Little (1, 2) 



79 




Fig. 5.5 — Term usage versus cumulative thesaurus utilizations of 
thesaurus for systems investigated by Houston and Wall (68). 
(See Table 5.3 for systems corresponding to numbered 

curves) 



.J 



80 



1000 



E 

£ 

II 

X 




2 (P(x)) - cumulative distributions: fraction of terms 
used X or fewer times 



Fig. 5. 6 — Term usage versus cumulative utilization of 
thesaurus for systems investigated by Wall (143) 
(see Table 5.3 for systems corresponding 
to numbered curves) 



J 



ERIC 



81 

plots are also linear and are shown in Figs. 5.7 and 5.8. The above 
relationship holds quite nicely for the keyword files analyzed by 
Litofsky. The ILR system, however, is too small as its TU is < 10,000, 
and the above constants require adjustment; the form of the relation- 
ship, however, is satisfied. — 

This empirical evidence is even more impressive when one compares 
the range in corpus and thesaurus size, the different subjects covered, 
and the variation in index term utilization. These pertinent system 
characteristics are tabulated in Table 5.3. 

5.3.2 The Term- Frequency-of- Use Canonical Form 

In addition to the graphical interpretation, which implies strong 
statistical stability, a number of efforts have been made to define 
the term-frequency-of-use versus rank relationship analytically. 

The most well-known attempt to '\^ine ^n equation form a general 
relationship between term frequency of nccurrence and term rank is by 
Zipf (152), who suggested the form: _ 

f{r)-r = K 

where K.^,a constant for a particular (large) sample of text in any 
language ^ 
f(r) = the frequency of occurrence of the term with rank r 
r = term rank; a positive Integer. 
This expression is based on empirical observation of free or run- 
ning text, and as noted by Mandelbrot (95) and falrthome (49), it is 
an extension of the earlier work of J. B. Estoup in 1916 and J. Willis 
in 1922. Mandelbrot (94, 95) using communication or information theory 



82 



1000 




Cumulative Distribution • percent of keywords 
having f or fewer oceurrencee 

Figure Jr-7 

Log-Probablllty Plot of Keyword Distribution (9^) 



83 




Fig. 5.8 — Log probability of descriptor usage ILR test system 



84 



€0 



S3 



a 







6930 


5410 


31.3 


40.0 


CO 

«o 


«o 


5087 


5280 


31.8 


30.6 


tn 


3253 


4890 


32.3 


21 .4 


and Ua 


««• 


1468 


4060 


3212 


I'll 


>uston 


CO 


2992 


3459 


15.0 


13.0 


2 


(M 


2100 


2100 


tn 

• 

00 


tn 
• 

00 






CO 
o 
CO 


1108 


14.3 


• 



CM 



If 



I/I 

2 



S 



CM 



CM 



CM 
CM 



CM 



OS 



^ CO 



Si 



00 



CO 





• 

< 


• 

< 


• 

< 






• 


• 


tn 


• 

< 


• 
• 


« 

< 

• 




• 






13309 






as 

^^ 

GO 

W> 



00 



CM 



Si 

.71 



M (AC I 

N 9 O « (A 

M 3 O O 

(/I 3 Ol Oi*i E ^ S €> 

3 « m U fICC AlCu 

S ^ ^ I 



I 



**CMr- 



X 
4» 

«> 4J X K X K 

c o «> a» j» 

X oi**-> c c c c 

3 C C C C 



^ CM CO 40 u> 



ERIC 



i 



85 



r 






























t 

< 
















1 
1 


00 

o 


1 


1 




% 
t 








CO 






1 














o 


f 


1 












CO 






1 


1 


1 








CO 




in 


CO 






1 












CO 


CO 








































in 
















CM 


in 


in 


1 


1 


1 












CO 
















ITS 




CNJ 




















I 
1 


9106 


1 

1 


1 
1 


1 
1 
























01 




















3 




















C 




ID 
























CNJ 


in 


in 


00 






4J 




CO 


o 




a% 


• 


• 


i 




C 




SO 




^3 


in 


in 


GO 


• 




o 








CNJ 


in 










o 




















1 




















a 




















CO 




















• 








o 


o 


CM 


O 






tn 








o 


a% 


• 


• 










"O 




o 


%o 


in 


00 


1 




01 




c 




in 


%o 








J 






m 




a% 












!q 


















; 
1 


m 




c 
















1— 




Sto 




CO 


o 




CM 










3 


00 




<o 


• 


• 










O 




o 


rs. 




CM 










3Z 








CM 


CO 












(/» 










CO 










u 




0) 




















N 


CO C 


a 




















3 


0) 












01 


(A 






0> 


















o o> 


U 








E 


1. 




(/I 














0> 


01 




3 


o 




o o 








-M 


-M 




1. 


0>~O 


01 CO 










CA 


U 


(/> 


3 
















<o 




«o 




lO C C 


m 0) 
jQ CO 










1. 


Q. 


CO 


0) 


S. 0) 0) 












1. 


0) 


0) fX 


0) E 


E 3 












O 




> 


> 


3 










o 


O 

















(/> 








in 






cn 




o 






o. 






0> 


1 




a: 


X 




<: 


01 










t/> 1— 


c 






1— « 










X 1 "O 


&. 




01 c 


o 


• 




Q. 




C 01 


01 




o: 


o 


CCD 




o 


g l-i i-^ 



a o 

X <0 
0> 01 w 
"O CO 
CO 01 X 
iO 01 

0> - 



ae 01 C7> o 



4^ i-« (/) 

0> t <0 f- 

•M > O 

f- O T- i-i JQ 

CO &. S 3 

oo o. ^ a. 



T3 (/) 

10 01 

01 X 

31 01 

01 



oooto 

10 f— XI 



a. c/) ^ 
CM CO •^r 




86 



as a basis has derived a relationship, between word frequency of use 
and the rank of a word, that Is more general than Zipf's, and of which 
Zipf's is a special case. Because of the various contributors, this 
relationship will be referred to as the Nandelbrot-Estoup-Jipf (MEZ) 
distribution, and has the form: ' 

f(r) =K(rfB)'* 

For B^Oand a=l, the above relationship reduces to Zipf's "Laiw" How- 
ever, Zipf's equation calls for a linear plot of slope minus one in log- 
log space, which Is not satisfied (even with congruent Intercepts) by 
the curves plotted in Figs. 5.3 and 5.4. 

For the purposes of this analysis it will be sufficient to show that 
the MEZ canonical form is close to the actual term-frequency of use 
versus term rank distribution. To illustrate how the parameters K, B 
and o are defined for a DRS (at a certain point in time), the test sys- 
tem characteristics will be used. For the test DRS: 

D = 102 
T = 370 

T' = 307 (the number of active terms in the thesaurus) 
D = 14 (the average depth of indexing) 
f(r=l) = 32 (the frequency of use of the term with rank = 1) 
f(r«300) = 1 (the frequency of use of a term with rank -300) 

Zipf [see Booth (10)] has noted that a term will occur once if 

1.5 > T P(r) ^ 0.5 



87 

where P(r) = the probability of occurrence of a term with rank r 




I f(r) 



r=l 

T = the total number of term occurrences 
T" 

= I f(r) 
r=l 

The above relationship can be generalized for a term occurring n times* 

(n+ 1/2) > T P(n) i (n - 1/2). 
Substituting the MEZ form for P(n) yields 

(n+ 1/2) > t K'(r + B)"« > (n -1/2). 

For a term with the highest rank, z T', and where B < T' (which 
is always the case— see Mandelbrot (95)), and n = f(T') = 1, the in- 
equality becomes: 

1.5 > t K"(r)"'' > .5 
Because the condition of interest is r_.„, only the right, side of the 

UmIX j 

inequality need be used* Therefore,^ 

T K*(rp = .5 

solving for K* yields 

f 



88 



Thus, given the number of different or active thesaurus terms, T', 
and the total number of term occurrences, T, one can estimate K' by 
assuming an a, or estimate a assuming a K*. According to Booth (10), 
Zipf (153), and Mandelbrot (93-95), a - 1. Since more is known about 
the range of a than K and all that is needed is a "quick" approximation, 
an a - 1 will be used. With a = 1, 

- (.5H307)' 
= 0.1 

*- 

Note, if f(r) instead of P(r) were being estimated, then 

K z 150. 

With a and K estimated, the next step is to determine B. 

The simplest way to estimate B is at the intercept f(r=1) 
where B is obviously not negligible because r * 1. Solving 

f(r) = K(r +B)-° 

for B, yields. 




For, K = 150, f(r) = 30, r = 1 and o = 1, the estimate for B is 4 to 
4.5 depending on whether a = 1 or 0.9, respectively. 

The comparison of MEZ values and the actual term frequency—rank 
distribution, for the test sample. Is shown In Fig. 5.9 and tabulated 
In Table 5.4. 



89 



100 



f(r) 



MEZ^ 


/ILR data 


K = 150 
I B=4 
a =0.9 
f(r) = K(r+B)-° 

L 1 1 1 1 1 1 1 


I-— 

1 1 1 1 1 I I L 



1 10 100 

Term rank 



Fig«5,9 — Comparison of ILR test sample term frequency of uses 
versus rank distribution with the canonical form 



90 



Table 5.4 

COMPARISON OF MEZ VALUES WITH ACTUAL TERM USAGE 
VERSUS RANK VALUES FOR THE TEST SAMPLE 



Rank 


ILR Test 
Sample 


MEZa 
yalue.<> 


1 


32 


35.2 


2 


27 


29.9 


3 


26 


26.0 


4 


24 


23.1 


5 


22 


20.8 


6 


21 


18.9 


7 


20 


17.3 


8 


17 


16.0 


9 


16 


14.9. 


10 


15 


14.0 


11 


14 


13.1 


12 


13 


12.4 


13 


12 


11.7 


14 


11 


11.1 


15 


10 


10.6 


* K = 


150; B = 4.5; 


a = 0.9. 



On the basis of this empirical evidence > the hypothesis that the 
term-usage versus rank relationships are closely approximated by the 
MEZ canonical form Is accepted. 

5.3.3 Depth of Indexing Distribution 

The depth of indexing distribution 1s an additional DRS character 
1st1c that can be used to determine statistical similarities between 
DRSs; The distribution Is derived from the DXT distribution (It 1s 
the distribution of the row marginals) as Indicated In Fig. 5.2. 



91 



The indexing density distributions for the test system and the 
two keyword systems employed by LitofsKy (9ofliPe shown in Figs. 5.10 
and 5.11 respectively. As for the term usage versus rank distribution, 
it would be very desirable to represent the depth of indexing distri- 
bution by a canonical form. While this exercise is not carried out 
here, a suggested canonical form is noted in Chapter 7. 

5.3.4 The Term-Term Co-occurrence Pi stri bution 

The term-term (TXT) matrix is derived from the DXT matrix as shown 
in Fig. 5.12. For the test system, the TXT matrix is quite sparse 
(=82 percent). The non-zero integer entries indicate the number of 
instances in which the two terms, defining the intersection, are used 
as common or co-descriptors for documents in the corpus. 

Three hypotheses have been put forward regarding the character- 
istics of the TXT matrix. Each hypothesis will be stated and then 
analyzed. The first case is: 

5.3.4.1 Term Independency. The TXT matrix is not generated by 
a process which selects terms for assignment independent of one another. 

A prevalent assumption in previous analyses is that the descriptor 
terms in the system thesaurus are assigned independent of one another 
to documents in the corpus. The often stated qualification is that 
while this assumption of independency is not exactly satisfied, it 
is a reasonable approximation. It does not appear that this assumption 
has ever been statistically tested. Perhaps a complicating factor 
is that the convenient chi-square test for goodness of fit is not 



92 




ERIC 



93 



10,000 



1,000 - 



I 




15 20 25 30 35 40 45 50 
Depth of indexing 



Fig. 5. 11 — Depth of indexing distribution for 
the systems investigated by Litofsky (90) 



94 



I £ 8 

See 



8 



S • • • 



i 



s 



0 I I t~U^t^ 
k «t «t 

1 " 

1 -^r JT 



e o o 

• II 



N N N 



5 



u 

& 

8 
I 

CM 
lA 



k 



95 



appropriate In this case. This Is so because the OXT matrix, as de- 
fined for the DRSs of Interest, Is binary and very sparse (I.e., a ma- 
trix condition In which the number of elements whosv value Is zero equals 
or exceeds the number of elements whose value Is near-zero). Thus the 
theoretical limitation of the chl-square test, which requires that 
the expected value of the sample of population elements to be tested 
must be at least equal to 5, Is not satisfied. Hence, the chl-square 
test cannot be used to statistically ascertain whether the DXT matrix 
Is or Is not generated as though the descriptors are assigned Inde- 
pendent of o»»a another. This situation also holds for the TXT matrix. 
Even though there are TXT(1,j) which exceed 5, there are many ele- 
ments that do not be use the TXT matrix Is also sparse;* this neces- 
sarily follows because 

TXT ■ (DXT)^.(DXT), and DXT Is sparse. 

In lieu of the chl-square, the test elected to apply to accept 
or reject the hypothesis Is called the "General Ized-Llkel Ihood-Ratlo- 
Test " (see Mood and Grayblll (102)). The Generalized-Likelihood Ratio 
(GLR) Is defined as the quotient 




where L(s) • the maximum of the likelihood function In the sanyle 
region or space s, with respect to the parameters 
LCd) » the maximum of the likelihood function In the population 
region or space o, with respect to the parmeters 

However, It Is easily shown that TXT Is never more sparse than DXT. 



96 



and, -2 Log e Is defined as a chl-square vaiiate. 

The null hypothesis of Interest Is that the descriptor terms are 
assigned independent of one another for each document-descriptor set. 
When Is true, -2 Log e Is approximately distributed as chl-square 
Mith N degrees of freedom when M :s large. Thus the null hypothesis 
can be tested by computing -2 Log e and comparing It with the desired 
level of significance of chl square. If -2 Log e exceeds the chl-square 
level, will be rejected, otherwise will be accepted. 

Given the DXT matrix, as Illustrated In Fig. 5.13, the desire Is 
to show that the assignment of any one of the terms In the matrix Is 
Independent of the occurrence of any other term; that Is to say, the 
probability of occurrence of term 1 Is Independent of term j. The 
null hypothesis Is: 

N ni M.n. 

where <■ probability of term 1 occ(irr1ng n^ times 
q< - 

N " the number of documents to be Indexed 
N * the number of terms In the thesaurus 
To test H^, the GLR e Is codyuted, where 

and. 



97 



j: — . 



t 

3 



o 



i O . 0 . i II 



o 



I I I 



Fig. 5.13 — Illustration of a sparse DXT matrix with sta- 
tistical parameters employed In the GLR 
test 



98 



where in the context of this problem, n./M are normalized frequencies 
and are taken to be sufficiently representative of the probabilities 
of occurrence of the descriptor terms. Also, 

P = Vector of (P^ P^) 

which maximizes the function L(s). In this case, the empirically ob- 
served frequencies of occurrences or the "best estimates" of the ele- 
ments of P. 

i 

Introducing Logs for ease of computation yields 

A C,,n ^ l^i' ""''^i 

Log L(s) =: I n. Log^+ (M-n.) Log(-f^) 

where the normalized frequencies, f. 

can be substituted, giving 

I I _ Sup N 

Log L (s) - p^j^^ „^ ^^,g ^ j Log(l-f^) 

Now it is necessary to compute, L(6) 

5"P^^"1 "n^ 

Ml 

L(o) = p ^"P i I (P. iJ ^ ^ 

*The Implicit assumption Is that a term can.be assigned only once 
to a document. Therefore, the maximum frequency of use of any term is 
the nuirfcer of documents in the corpus, M. 



99 



where P i« is the probability that a randomly chosen document 

has descriptor vector n(i^ i^) and is^defined by 

Substituting, and introducing the Log for convenience yields, 

^ Sup - N /n(i, iM)\ 

I-OSW-P, , f ' I n(1, i^) Log (— 

Assuming, that the identical occurrence of n(ip...,i|^) for more 
than a few documents is not a very likely event, then Log L(6) dan be 
simplified as follows 

Log L(o) = I jd(j) Log (J); for j < M 
j=l " 

where K is the maximum number of congruent document vectors, and d(j) 
is the number of descriptor vectors which correspond to exactly j docu- 
ments. In fact, the usual case (of which the test system is an ex* 
ample), K'^l, and the above relationship reduces to 

Log L(o) = M Log g 

Therefore, the expression to be evaluated is: 

The most unlikely event is when the identical c.currences of 
n(ip...>i|^) is M, which means the corpus consists of M "identical" 

items ~ in so far as the thesaurus subject delineation of concepts/ 
subjects is concerned. 



100 



Log e = I n. Log f. + (M-n.) Log(l-f,) - M Log 1 

and, -2 Log e is the chi square variate of interest with N degrees of 
* 

freedom. Since, for this analysis, N « 370, the normal approximation 
to the chi square distribution is used. 

For the test sample, -2 Log e = 850 which is larger than the nor-, 
mal approximation to the chi square, which at the .005 level, = 480. 
Therefore, the hypothesis of term independency is rejected. 

5.3.4.2 Term-Term Co -occurrence Factor. The next hypothesis to 
test is whether the co-occurrence of two terms is directly proportional 
to a function of the frequencies of use of the terms. 

In Chapter 4, two candidate functions were proposed: 

I. TXT(1.J) .y( f . (<)-fU)) 

n. TXT(1.J).,, '««)^cs(, t ) 

Where f(i) = the frequency of use of term i 
TXT(i,j) = the value of the intersection of term i and j 

RS(i) = the sum of the entries in row i 

CS(i) ■ the sum of the entries in column j 
D = the number of documents indexed. 

The relationships of the above functions and variables and the TXT 
matrix are shown in Fig. 5.12. ' ' 

The variables of interest In the above equations are the y's. 
That is, in order for the estiniations to be useful, the distribution 

* 

The variable is allowed to vary over the range 0 to 1, with 
J *1 > • • • >N • j J ' 



101 



of values for y must be stable and stationary. Therefore the forms of 
the relationships that will be analyzed are: 



T. - TXT(i.j) _ Actual . 

^ ^ " mViU) ' Theoretical 



0 

ir _ TXT(i,j) _ Actual 

^1 ■ R§(i'}'CS(j) ■ Theoretical 
zRS(i) 

A computer program was written to analyze a sample of the test 
DRS TXD distribution. The program generated the TXT(i,j) for every 
non-zero cell in TXT, computed the values of the candidate function, 
and the ratio of the actual to theoretical values for y and y^. A small 
sample is presented in Table 5.5. It is clearly evident that relationship 
I or r is superior to relationship II or IT. Function II is very un- 
stable (it has a large variance) and it is not suitable as an estimator 
of the value of TXT(i,j). 

On the other hand, function I is very stable. The plot of theo- 
retical Y versus f(i) in log-log space is always linear, and all the 
, theoretical values of y for any f(i) can be determined from a knowledge 
of the relationship of f(l) and the y's for f(l). An illustration of 
this relationship is given in Fig. 5.14. 

The empirical values of y for terms with f(1) = 1 to f(i) = 32,* 
are plotted in Figs. 5.15 to 5.30. As shown, each occurrence or value 
of Y either falls on the theoretical lower bound or lies above it on 



The highest term frequency of use in the data sample. 



102 



m 
in 

jo 



e 



o 



o 

CO 

z: 
o 
o 



X 
CO 



CO 
X 
CO 



CO 

X 



CM 

X 



ooinincsKooorooioi 
— r- 



oooooooooo 
minminininininino 



oooooooooooooooooor*^ 



r^cMvororocoroa»r-o 
f— cj in ^ r-* in CO oi 

r- r— r- ^ r- CO 



fo ro CO CO CO CO co co co co 

CO CO CO CO CO CO CO CO CO CO 



OC\Jr-r--Lf>r^r^r-»n 

CMio^^iocNJCNjroo^ 

r— r- r- in r— r- 



inininiAinininininin 



>- in m in in in in in in in in 

CMCMCMOJCVICNJCMCVICMCM 



in r— at r- i 

CMCM OJ r— ^ ^ OJ 



CO CO CO ro ro ro ro ro ro ro 



gf**. r— lO ^v r- 00 
lo o ^^ lo 
>• •••••••III 



inininmininininmin 



>. •••••••••• 

Or-r— VOOOOOJrOOO 
r— r-r— r-r— CMCMCVir^r- 



CMCMCMCMCMCMCMCMCMCM 
^- OOOOOOOOOO 



r— cMro^inu>r^oootO 



X 



in 

X 



X 

ro 



in 

X 
CO 



oio^^inr-ooooinoo 
cMr^rootinotcvivovoro 



VOVOCMVOCVJVOVO^VOCVJ 

• ••••••••• 

roror*«*rof*«*rorororor*«* 



r-> vo^mor-cvjroa»r-in 
^ is! ^ ro OJ ro ^ ^ OJ CNj 

r- CM f- r- 

OOOOOOOOOO 
r-r-r-r-r-r-r-r-r-CVJ 

^ in in in in in in in in in o 



^rooooQvor^ooincTtr- 

VOVOVOr— ^CVJr-VOCNirO 



ro^cNiroinr^ocMvo^ 



moooooooQcooooQinoo 
t^i^rOrOrOcorOrOrofOfcro 

CMVOVOVOVOVOVOVOCMVO 



C7t 00 ro \o 

^ C7t C7t CM ^ 



CM in CM in in in oi 



r*^r*^r^r^ror^r^rs.r^r^ 
xQ vo ro ^ ^ 



ininininr^ininininin 



O<«>roinaioino>roro 
in(OcMOOroinr*H(00 in 



C0t^^in<X%\OO*iOr^^ 
CM r- CM r- 00 



OOOOOOOOOO 
00 00 00 00 00 00 00 ^00^ 

%Q\Q%Q\OiO^^CO%OCO 



^•cMro^intor^ooo^o 



ERIC 



103 




104 




f(i), term frequency of use 



Fig.5.15 y factors for f( |) = 1 , 2, 3, 4 and 
1 <f(i) < 32 



105 




106 




108 



100 



\ ^ 








i 1 .1 .1 LI 1 i 


V; ^ 

. . 1 .. 1 . 1. 1 u u 



1 10 100 

f(i), term frequency of use 



Fig.5.19 — X factors for f ( f) = 6 and 1 < f({) < 32 

ERIC 

I 



109 



100 



10 



0.1 



\ 7 




■ V 




II 1 1 1 1 1 1 


. 1 1 1 L 1 .1 1 1 



10 

f(0# term fraquenc/ of use 



100 



Fig.5.20 — r factors for f( {) » 7 ond 1 < f(i) S 32 



Ill 




112 



10 



0.1 



:\ f(i) 

- \^ 




1 1 ! 1 1 1 1 1 


1 1 1 1 L 1 _I_L_ 



10 

f (i), term frequency of use 



Fig. 5.23 — y factors for f(i) = 10 and 1 < f(i) < 32 



100 



I 

5* 



ERJC 



1 



i 




ERIC 



114 




115 




Fig.5.26 — y factors for f( j) = 13 and 1 < f(I) < 32 



116 



10 



\ f(i) 

- X 15 

\ \ 

X V 


^ X 

* \ 

^V^^"^ 


1 1 1 1 L 1 i 1 





1 10 100 

j f(i), term frequency of use 

Flg.5.27— /factors for f(j) = 15 and 1 < f(!) < 32 



1 
■ 



ERIC 



118 



10 



0.1 



_\26 


\ 


Mill ,.-L. . ..1 L 


XX V 

1 1 1 1 1 1 1 1 



fO), term frequency of use 
F!3.5.29— /factors for f (j) = 26 and 1 < f (f) < 32 



119 



10 




120 



a curve that Is an Integer multiple of the lower bound value. It Is 
always the case that the theoretical minimum value of y Is the lower 
bound, and that whenever there Is a difference between the lower bound 
and the actual, the actual value Is always an Integer multiple of the 
lower bound. For y i S, the dispersion of y values Is small, and 
Increases for 5 s f(l) s 32. 

In an attempt to assess the distribution of the Y-f actor values, 
plots of the cumulative distribution of occurrence versus the ratio 
of Y actual to y theoretical minimum were prepared,* and are presented 
in Figs. 5.31 to 5.37. For terms with a high frequency of use. It Is 
necessary to Introduce a weighting factor, which as shown In the next 
section Is a stable and well behaved factor. At this point, sufficient 
evidence has been accumulated (Table 5.5, and Figs. 5.15 to 5.37) to 
satisfy the hypothesis that the term-term co-occurrences are definable 
as a function of the term frequencies of use and are directly propor- 
tionate to that factor. 

5.4 THE RETRIEVAL QUANTITY MEASURE 

As described previously', the Retrieval Quantity (R^) measure 
Indicates the quantity of documents (references) that are output by a 
DRS In response to a formal Inquiry. The purpose of this section Is 
to develop an operational form of such a measure, and to test the 
measure with a set of actual Inquiries on an operational system. 

The procedure for predicting R^ for an Inquiry entails several 
steps: 

Note this analysis Is restricted to TXT(1 ,j) > 0. 



121 



1-1 



20 - 



J I I I I I I l_l 

2 4 6 8 10 

Actual y/theoretical y 



Fig. 5.31 — Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 3 co- 
occurring with terms with frequency of use of 1 to 3 



122 



1 i I I I I I 



2 4 6 

Actual X /theoretical y 



8 



10 



Fig. 5.32— -Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 4 co- 
occurring with terms with frequency of use of 1 to 4 



123 




I 



J I L 



4 6 
Actual r/theoreticaf Y 



8 



J I 

10 



Ftg. 5.33— Cumulative frequency of the ratio of actual y 
to theoretical y for terms with frequency of use of 5 co- 
occurring with terms with frequency of use of 1 to 5 



124 




Fi9*5.34 — Cumulative fretjuency of the rotio of octuol y 
to theoretical y for terms with frequency of use 
of 10 co^occurring with lerms with 
frequency of use of 1 to 10 



125 



3^ 










"4-* 












r^io 










'^i 










1 1 1 


1 1 1 


1 1 


J 



0 2 4 6 8 10 

Actual y /theoretical y 



Fig •5*35 — Cumulative frequency of the ratio of actual Y 
to theoretical / for terms with frequency of use 
of 15 co'occurring with terms with 
frequency of use of 1 to 15 



126 





qI I I I I I I I I I I 

0 2 4 6 8 10 

Actual / /theoretical y 



Fig. 5. 36 -^Cumulative frequency of the ratio of actual / 
to theoretical / for terms with frequency of use 
of 20 co-occurrir>g terms with frequency 
of use of 1 to 10 



127 




Fig. 5. 37 — CumuloHve frequency of the ratio of octual y 
to theoreticol y for terms with frequency of use of 32 
co-occurring terms with frequency of use of 1 to 10 



128 



(1) construction of the formal inquiry (from the user request) 

(2) application of the term co-occurrence factor — y 

(3) determination of 

Step (1) has been discussed in Chapter 4, and steps (2) and (3) will 
be analyzed in this section. 

5.4.1 Application of the Term Co-Occurrence Factor , 
Y» and Determination of Rq 

Step (2) involves the application of y to the explicit conjunctive 

arguments, and inqslicit intersections of the disjunctive arguments in the 

inquiries. Taking a simple example such as T^'Tg, for which R^^ is the 

term co-occurrence value (TXT(1,2)), the lower bound estimate of R^ 

Is: 

Y is found by using the appropriate plot of y and the term frequencies 
of use (e.g., plots like Figs. 5.31 to 5.37), and the variables f(1), 
f(2) and D are readily determined for ar\y operational system. 

A few examples will help to illustrate the estimation procedure: 
1. Request: Retrieve all those documents that discuss the con- 
cept of Coordinate Indexing 

Formal Inquiry: Concept and Coordinate Indexing 
From Appendix C, the frequencies of use of each inquiry term, in 
the sample data are: 

f (concept) = "7 

f (Coordinate Index) « 10 
D « 102 



129 



From Fig. ^.20, for f(1) = 10, y = 1.46, and, 

\ - (1.46)(^) = 1 

which Is exactly correct, for the sample data base. 

2. Request: Retrieve all the documents that discuss classifi- 
cation and clumping 

Formal Inquiry: Classification and clump 
From Appendix C, the frequencies of use for each term are: 
f (classification) = 20 
f (clump) = 5 

From Fig. 5.28, for f(1) = 20, y = 1.02 and the theoretical lower 
bound estimate of R^ Is: 

Rq = (1.02)(^) = l 

which is less than the actual number (3) of documents described by the 
two terms. In the sample data base. 

These examples show that for combinations of low frequency of use 
terms the lower bound theoretical y-f actor leads to accurate R estl- 
mates, but tends to diverge from TXT(1 ,j) as f(1) and/or f(j) Increases. 
However, when the lower bound y value causes the R^ estimate to be less 
than the actual value, the difference or correction Is always an Inte- 
ger multiple of y. 

One way to correct for the underestimation for large f(1) is to 
employ a simple weighting scheme. That Is, to apply weights (proba- 
bilities) to Integer multiples of y lower-bound, with the weights 
reflecting the proportion or frequency of occurrence of the values of 



130 



TXTd.j) for the terms 1 and j of Interest. For example, the es- 
timate for a n-term conjunction would be 



ERIC 



Rq = (Ha,) 



^(fUhfUl) 



2y(-) 



+...+ nY(*) 



where n « Cf(^)^f(j)3,„^n^„^„,^ and can be estimated from plots of the 
cumulative frequency of the ratio of actual to theoretical y's, as In 
Figs. 5.31 to 5.37, or from the cumulative distribution of the values 
of term co-occurrences, such as In Fig. 5.38, or the density distri- 
bution of the values of the term co-occurrence, as In Figs. 5.39 and 
5.40. 

For example 2 above, the corrections are determined from Fig. 
5.40, for f(1) « 20 and f(j) = 5; 

a.| = 0.51 
og » 0.21 
oj = 0.10 
= 0.06 
og « 0.04 

The Rq estimate Is now: 



1+(.51)(1) + (.21)(2) + (.1)(3) + (.06)(4) + (.04)(5) 



1.02(fg^; 



= 2.72 



which Is a much better estimate of the actual value of 3. The distri- 
bution of a^'s Is quite stable, and In Section 5.4.2 they are Incorpor- 
ated into the y versus fCi) plot (see Fig. 5.42). 

Unlike the above simple examples, most requests are a string of 
conjunctively and disjunctively related terms, and In general the string 



131 




Ol 1 1 1 \ 1 I I L_l I I I 

0 2 4 6 8 10 12 

TXT(l,j), term co-occurrence value 



Fig .5.38 — Cumulative frequency of occurrence of TXT(i,j) 
for terms with f(l) = 1,2,3,5, 10, 15,20,26 & 32 
and 1 f(j) 32; only non*-zero co~ occurrences 
plotted 



132 



s 

2 
'B 

M 




X 



2 4 6 8 10 

TXT(i, j): Term co'occurrence value 



12 



Fig .5. 39 — Density distribution of occurrence of TXT{i, j) for 
terms with f{i) = 1,2,3,5, 10, & 15, and 1 < fQ) 
< 32: Only non-zero TXT (i, j) plotter 



133 




Fig.5.40 — Density distribution of occurrence of TXT (i, j) 
for terms with f(i) = 20, 26 & 32, and < f ( {) < 32: 
only for non*-zero TXT (i, j) 



134 



win contain more than two terms. When more than two terms rre In- 
cluded In an Inquiry, the estimation of requires an Iterative pro- 
cedure. For example, consider the following Inquiry: 

T.| and Tg and Tj and ... and T^^ 

To estimate one must: 

(1) determine f(T^),. .. ,f(T^) 

(2) determine y for f(T.|) and fCTg), and the theoretical value 
of TXTdpTg) by Y-(f(Ti)-f(T2)/D); this value Is the 
Intersect of T-j and Tg 

(3) call the Intersect of T.| and Tg. T.| and determine the inter- 
sect of T.| and Tj, as per step (2) 

(4) repeat steps (2) and (3) until the Intersect of T^_.| and 

Is determined; this Is the R^ estimate for the n-term 
conjunctive series 
In the event that a request contains one or more disjunctions, 
the above Iterative procedure Is modified as follows. Consider an 
Inquiry of the form: 

(T, or Tg) AND (T3 or T^) 

To estimate R^, recall that 

RqCT^+Tg) = f(T,) + fCTg) - TXT(T,,T2) 

and Incorporate this relationship In the Iterative procedure: 

(1) determine f(1), for 1=T.|, 1^* ^3 and T^ 

(2) determine y for f(T,) AND f(T,), and the theoretical value of 



135 



TXT(T,,T,) by Y-(f(T,)-f(T,)/D) j this is the Intersect of T, 
' . f(T,)-f(Tj,) ^ 

and Tg, and therefore = f(T^) + fdg) - y — ^5 — ^ 

(3) repeat step (2) for all other disjunctive pairs. 

(4) when all the disjunctive groups have been reduced to their 
"net" respective T^'s, the remaining expression Is simply a 
conjunctive series and the estimate Is determined as for 
the previous example. 

At times an Inquiry will contain an explicit negation of a term, 
such as In the following example: 

AND NOT Tg 

To estimate R^, an additional modification of the above procedure Is 
required. Recall that, 

Rqd^-Tg) = f(T,) - TXT(1,2) 

yields the net R^. Therefore, for those clauses In which there Is 
a negated term, the above relationship Is determined, and the resulting 
net T^ Is used to compute the remaining conjunctions and/or disjunc- 
tions of terms. 

Having established an Iterative procedure to estimate quantity 
output for complex Inquiries, the next step Is the evaluation of the 
Rq estimation process. 

5.4.2 Testing the Rq Estimate 

The data and Illustrations presented thus far reflect the sample 
data, and It Is necessairy to extend the findings to the test system 



136 



to evaluate the estimate. In order to do this, certain logical prop 
ertles of the relationship 

TXT(i.j) . , fin^ 

must be established. 

As demonstrated, the above relationship Is linear with slope of -1 
In log-log space.* Further, for any DRS, all curves for any com- 
bination of f(1) and f(j) are derivable from the theoretical curve 
for f(1) • 1 and 1 < f(J) i D. To show this, the first step Is to 
determine the Intercept for the curve f(1) « 1 and 1 i f(j) i D. 

The ordinal Intercept for f(1) « 1 and 1 < f(1) < D Is defined at 
f(1) « 1 and f(j) - 1, which yields 

TXT(1,J) « 1 « 

or 

Y » D 

which Is the value of the Intercept on the y-axls. The Intercept on 
the f(1) axis for the curve f(1) = 1, 1 t f(j) < D can be determlnftd 
In a similar manner. Setting f(1) « 1, and f(j) » D yields 

Txrdj) 

or 

Y " 1. 

Therefore, all one needs to know to establish the value of the 
Intercepts for the curve f(1) ■ 1 and 1 < f(j) < D, Is the size D of 
* 

For the theoretical lower bound. 



137 



the system corpus, and that the term usage versus rank distribution Is 
approximated by the MEZ canonical form. 

The curve just determined Is the loner and upper bound for all 
values of y for terms with f(1) » 1, and 1 ^ f(j) < D. In addition, 
this curve Is the upper bound on the values of y for aVl, other y for 
any combinations of term frequency; that Is, for 

1 < f(1) £ D 
1 $ f(j) $ D 

Further, on the basis of the above curve for f(i) ■ 1, and 1 < f(j) < D 
the theoretical }oner bound values of y for all other combinations of 
f(1) and f(J) can be determined. The procedure to determine these 
lower bound curves Is Illustrated In Fig. 5.41, for the test corpus 
with D ■ 416, and f(J) » 10, and consists of the following steps: 

(1) locate f(j) « 10, on the abscissa (point I In Fig. 5.41). 

(2) follow the vertical line up to the Intersection (point II) 
with the line for f(1) » 1, 1 $ f(j) < 416. 

(3) follow the horizontal to the ordinate Intercept (point III), 
which gives the value of y for f(1) » 1 and f(j) » lo. 

(4) trace the 45" line, with slope -1, to Its Intercept with 
the abscissa, at y*1 (point IV) 

The resulting line betireen points III and IV, and extrapolated 
beyond. Is the theoretical lower bound for y for f(j) » 10, and 1 < 
f(1 ) s D'; where D' < 416 - f(1). That Is, If one were to estimate 
the Intersect, TXT(1,j) of two terms with f(1) « 10 and f(j) » 416, 
respectively. It Is clear that TXT(1,j) « 10, by definition. There- 
fore, In order that the theoretical lower bound curve satisfy that 



138 




Fig .5 .41 — y factor versos fO) for ILR oVjument 
retrieval s/stem, stage II 



139 



condition, it must be asymptotic to the line for y=U and intercept 
that line in the vicinity of f(j) = 416. 

Given the basis for constructing the theoretical envelope, and its 
bounds, of Y values, the next step is to determine the best estimate 
values of the y factor between the upper and lower bounds, for the test 
system. The best estimate values of y can be determined using the fol- 
lowing assumptions about — and properties of — coordinate index DRSs. 

(1) the sample data base is representative of the parent or test 
system, and the divergence data indicated in Figs. 5.31 to 
5.40 can be extrapolated to the value of ^(i)n,ax in the test 
system. 

(2) the upper bound of the Y-curves for any term is defined by 
the curve of slope (-1) for f(i) = 1 and 1 ^ f(j) i D, in 
log-log space. 

(3) the lower bound of the Y-curve for any term j, is defined 

by the curve, with an ordinal intercept defined by the inter- 
section of f(j) with the curve for f(i) = 1, 1 < f(j) * D, 
and an asymptote to y=1 i" the vicinity of D' , where D' = 
D - f(j). 

(4) for any two terms, the value of the y factor must be the same, 
regardless of the sequence of determination; that is, the 
curves must possess a symmetry such that 

^f(i).f(0) ' ^f(j).f(i). 

This property follows from the fact that TXT Is symmetric; 
i.e.. TXT(i,j) = TXTU,1). 



140 



Using the above assumptions and properties, the best estimate y 
factor curves for the test system were derived* and are presented In 
Fig. 5.42. From property (3), one would expect that the test system 

Y curves would be asymptotic to the line of slope y=1 for f(1) = D. 
However, there Is Instead an apparent convergence of curves at 3 $ 

Y i 5, for high f(1). Since the data sample bad very few points In 
the range of 30 < f(1) < D, It was not possible to analyze this char- 
acteristic In depth. However, It Is likely that the reason for this 
property Is that the test system Is small (D « 400 and T = 400) and 
as the product of f(1)*f(J) approaches or exceeds D, the Intersection 
of the two terms Is going to be substantial, and hence the convergence 
of Y-curves for high f(1) (but « D) at y > !• 

In order to evaluate the estimation. process, based on the y- 
curves In Fig. 5.42, a set of 15 requests of various content was gen- 
erated. The requests are considered to be typical and corpus subject 
related, and are not based on the descriptions of any one document 
or set of documents.* The test Inquiries are listed In Table 5.6. 

The Rq values, both estimated and actual, for each Inquiry were 
determined for direct match searches and are reported In Table 5.7.** 
The estimated values a.-a, for all Inquiries, very close to the ac- 
tual Rq, and clearly demonstrate that the Retrieval Quantity for an 
operational coordinate index DRS can be accurately predicted for formal 
inquiries. 

* 

The intent was to avoid the early Cranfield (see Ref. 130) or 
"Moore's" type inquiry, in which requests are generated from document 
descriptor sets. Such inquiries test the system retrieval search link- 
ages, but are certainly not representative of the typical user request. 
** 

Some sample computations are Included in Appendix D. 

ERIC 



141 




142 



Table 5.6 
TEST INQUIRIES 



Inquiry 



1. Auto, indexing and auto, ab- 
stracting and (theory or analy- 
sis or experiment) and not 
manual indexing 



Form 



Term 
Frequency 

^3^ 



2. Comp. linguistics and syntax 
and semantic 



^1 • ^2 • h 



3. Natural language and (auto. T, • (Tj+T,) • T. 
indexing or auto abstracting) 
and experiments 



'2' 
^3^ 

^3^ 



4. STAT association and (clump 
or cluster) and experiment 



Ti . (Tg^Tj) . T, 



5. Automatic and indexing and 
(coordinate or subject heading) 



T,. T^ • (T3.T,) 



6. Measure and relevance and 
evaluation and (theory or 
performance) 



T, • T2 • T3 .(T.+Tg) 



7. Simulation and (retrieval or T, •(T^+T^+T-) 
info, retrieval or document) 



143 



Table 5 ,6— continued 



Inquiry Form 



8. Theory and (documentation or T, • (To+T-j) 
info, retrieval) * ^ ^ 

t 

9, Design and retrieval system and T, • Tp • (T^+T-) 
(on-line or real-time) \ d 3 4 



10, Design and automatic and re- T, • T^ • T 



trieval system 



(design or evaluation) 



(evaluation or analysis) 



lation 



1 '2 '3 



11, Computer and education and ^1 * ^2 ' ^^3'*'^4^ 



12, Question and evaluation and L • T^ • (T^+T-) 
(Boolean or logical) ' ^ ^ ^ 



13, Depth-of- indexing and T^ • {"^2*^3^ 



14. Natural language and trans- T, • T 



1 '2 



15, Abstracting and centers and "T, * Tp • T-, 
jj^trolled ' ^ ^ 



Term 



Frequency 


f(T,) 


= 


20 






10 




= 


84 


f(T,) 


= 


9 




= 


15 




= 


3 






1 


f(T,) 


= 


9 






28 


f(T3) 




1 c 

l9 






69 






15 


f(T3) 




9 






44 


f(T,) 




33 






44 




s 


13 


f(T,) 


s 


4 


f(T,) 




8 






44 


f(T3) 




53 


f(T,) 




38 






31 


f(T,) 




13 


f(Tp) 


s 


5 



144 



Table 5.7 



COMPARISON OF ACTUAL AND ESTIMATED 
Rq FOR DIRECT MATCH SEARCHES 



Inquiry 


Rq-Actual 


Rq-Estimate 


1 


2 


1-2^ 


2 


2 


1-2 


3 


2 


3-4 




0 


1-2 


5 


2 


4 


6 


0 


3 


7 


2 


3 


8 


13 


15 


9 


9 


0-1 


10 


1 


1-2 


11 


3 


4 


i 


1 


2-3 ! 


13 


6 


5-6 j 


14 


12 


12 


15 


1 


0-1 1 

i 



^The Rq estimate is frequently 
a non-integer value and the ranges 
indicated are integer bounds. 



145 



5.5 THE LIKELIHOOD OF NON-ZERO TERM-TERM CO-OCCURRENCES 

The analysis and results presented thus far have implicitly assumed 

that the probability of term- term co-occurrences for terms with f(i), 

> 0 (for actual inquiry combinations for a homogeneous corpus) is 

significantly greater than zero. Thus the y factors presented in Fig. 

5.42 can be viewed as the values to estimate TXT(iJ), given that 

> 0 and that terms i and j do indeed co-occur. Since the 

DXT matrix is usually very sparse (for the test data sample approximately 

95 percent of the cells are zero), and also that the TXT matrix is usually 
* 

sparse (for the test data sample, approximately 82 percent of the cells 
are zero), some insight into the behavior of 

P(TXT(i,j)|f(i), f(j) > 0) 

as a function of f(i), f(j),and the number of terms with the same fre- 
quency of use is desired.. 

The theoretical probability, based on independent term usage, that 
the co-occurrence of two terms is greater than zero, given that each 
term has a frequency of use greater than zero, can be determined as 
follows: 

Given: D documents = {d} 
T terms (active) = {t} 

* 

II can be shown that the sparcity of TXT is always less than or 
equal to the sparcity of DXT; where TXT = (DXT)T(DXT) and DXT(i,j) ^ 0 
for all 



146 

the frequency of use of term t; 1 ^ i ^ D 

the number of terms with frequency of use i; 1 ^3. ^ D 
(e.g. 9 = the number of = 3) 

where: 

k 

m=1 

T k 

y i = I m =N 
t«1 m=1 ^'m 

N = the total term frequency of occurrences 

For this analysis, one may specify an initial distribution for 
(j^ and then for all the terms {t}, to select i^ documents 

it 

at random and without replacement and use the terms to describe the 
document. 

For computational convenience the probability of non-occurrence, 
f(TXT(tg,t,j) = 0 will be determined, and then the P(TXT(tg,t,j) > 0 = 
1 - P(')' A general condition on "P is that: 

P = 0 for i+ + i+ ^ D 
^a ^b 

For the case in which i^ + i^. < D, the simplest situation is where 

^a ^b 

only one term is used i,. times and only one term i. times; that is, 

*a *b 
J. = J. =1. For notational convenience, let 

This constraint is necessary because any one term can be assigned 
to any one document only once. 



let 1j = 



147 



\yio) = P(TXT(t^.t^)) = 0 



For this case: 



t 5 - . D-x-1 D-x-y 



x.y - IT IPT "Dly 
(D-x-y) !bl 



However, the more general condition Is when there Is at least one term 

that Is used 1^ times and at least one term that Is used 1^ times; 

a b 
that Is, > 1 and jy > 1 and jj^ ^ jy. 

Let X = the nunber of documents described by at least one of 

the terms with frequency of use x 

Y s the nunber of documen described by at least one of 

the j terms with frequency of use y 

Given X and Y, for those terms with the same frequency of occurrence, 

the probability that there are no co-occurrences Q)(^y^°^ °^ these 

terms Is exactly the probability „(o) defined above: that is. 

When the specific number of co-occurrences X and Y are not known, the 
value of P(X) and P(Y) must be determined. Under these conditions, 
the probability that there are no co-occurrences Is defined as 



148 



Q(o) = I P(X)P(Y)Qj(y(o) 

= i I p(x)p(y)q„y(o) 

X=x Y=y 

where P(X) = probability that X documents are described by those j 
terms with frequency of use x. 




A special case of the above general relationship Is the proba^- 
blllty of no co-occurrence among the terms with frequency x, where 
for x«jj^ < D, 

(1) the nuiroer of ways th6 event no-occurrence can occur equals 

("•J 

and 

(2) the number of possible events In the space D with terms. 



149 



with frequency of use x, mapped onto that space equals 




Thus the probability of no co-occurrence for terms j^^ with the same 
frequency of occurrence Is: 



III. 




Of particular Interest are the lower bound conditions or proba- 
bilities that describe the co-occurrence of terms with f(1),„^n 

I4. = 1; that Is, terms with frequency of use of one. This probability 
^a 

can be viewed as the threshold case because, as shown In previous sec- 
tions, the co-occurrena» of terms 1 and j with f(1> and f'j) > 1 is 
always greater than or equal to the f(i),^<)n ~ 1 case. 

A plot of the theoretical probability of at least one co-occur- 
rence for terms with f(1) or x = 1 with varying values of (1 i 

i D) Is presented In Fig. 5.43. In the range of = 12 it 1s as 
likely to have a co-occurrence as not, for the theoretical distribu- 
tion, and for any values of > 12 the likelihood of at least one 
co-occurrence is very high. The probability of co-occurrence for the 
actual test data is, for the few points computed, greater than or 
equal to the theoretical case. As such. Fig. 5.43 affords a conven- 
ient lower bound estimation on the probability of at least one 



150 




0 10 20 30 40 50 60 70 
J^/ the number of terms with the same frequency of use 

Fig. 5. 43— Theoretical probability P = l -(P(TXT(i, j)=0) 
versus 1< jx^80, for f(i)-f(i) = l 



ERIC 



151 



co-occur|pence for terms with frequency of use of 1 as a function of 
the number of such terms. 

Operationally, this means for the test sample where ' ^ 
X ■ f(1),„^n * l.that one Is better off assuming that a co-occurrence 
exists than not and at worst the R. estimate will be off by one In e 
few cases. 

A sample of the term-term co-occurrence for the test data Is 
tabulated In Table 5.8. The columns are labeled In terms of the vari- 
ables noted In Eq. II. 



Table 5.S 

TERM-TERM CO-OCCORRENCES BETWEEN TERKS 
WITH DIFFERENT FREQUENa OF USE 



X 








£TXT(t,.t^) 




80 


1 


80 


83 




1 


2 


36 


61 






3 


44 


68 






4 


39 


91 






5 


24 


76 






6 


17 


'•4 






7 


10 


»7 






8 


13 


■>6 






9 


6 


18 






10 


4 


14 






11 


3 


19 






12 


6 


47 






13 


4 


23 






14 


3 


8 






15 


3 


18 






16 


1 


10 






17 


2 


12 






20 


2 


15 






21 


2 


29 






22 


2 


24 






24 


1 


15 






26 


2 


10 






27 


1 


6 






32 


1 


10 



No tenas In U.e t€:St data 
were used for f(i) » 18, 19, 23, 
25, 28, 29, 30, 31. 



152 



5.6 WORD ASSOCIATION COEFFICIENTS 

The relationship between the elements In the TXT matrix and the 
p<«d1ct1on functions, Y(f(1)*f(j)/D)* Is based on the assumption that 
descriptors are assigned to documents In a binary manner. That Is, a 
term Is or Is not assigned as a descriptor, or In other words, the 
term assignment weights are 0 and 1. 

In many Instances, the^e is a need to elaborate upon an Inquiry 
so that additional documents can be retrieved. A common technique to 
accomplish Inquiry expansion Is through word association; that Is, by 
disjunctively Incorporating new terms with those terms In the Inquiry, 
with which they are highly correlated/associated. By necessity, these 
correlation relationships have non-Integer values, and are derived from 
the TXT distribution. 

In the Institute of Library Research DRS, a coefficient of asso- 
ciation Is determined for all co-occurring Index terms. For purposes 
of processing convenience, only the four highest correlating terms 
are retained as association words for the base term. In the event 
tiiat an Inquiry Is to be expanded, a disjunct Is formed with the origi- 
nal term and Its four most highly correlated terms. In general, the 
associated set of terms will be different for each Index term> and 
the meirbers of the set of associated terms can be different for any 
one term depending on the word association measure used. 

It can be shown that the term co-occurrence factor y can also be 
used to estimate word association coefficients. Foliating Kuhns (81), 
the form of a general class of coefficients of association Is defined 
to be: 



153 




where 

6(i,j) = iTXT(i,j) -lllhfLil 

A sample of the set of candidate expressions for a are listed in 
Table 5.9. For the derivation and rationale of these forms, and their 
application^ see Kuhns (81) and Maron,et al. (98), respectively. 

As noted earlier, 

TXT(i,a) = y 

and substituting into C^(i,j), yields 

'^;(<-j> = 

Therefore, one can estimate the coefficient of association for any 
two terms knowing the y factor for the DRS. 

5.7 SYSTEM GROWTH IMPACT ON RETRIEVAL QUANTITY 

All operational DRSs must sustain changes in corpus collection 
and content, and thesaurus size in order to remain useful over time. 
However as the corpus and thesaurus change, particularly in size, 
the perfonnance of the DRS also changes; for the same inquiry it is 
very possible to get different output sets from a DRS at different 
points in time. 

In order to demonstrate the sensitivity of quantity output to 
changes in the system corpus and thesaurus for different search 



154 



Table 5.9 

COEFFICIENTS OF ASSOCIATION PARAMETER - a (81) 



Symbol 


Parameter a 


Dtscrlptlon of Parameter 


S 


0/Z 


Measure of the separation or 
**d1 stance between the terms'* 


G 




MHsure of the angle between the 
vectors representing the terms 


W 


Min(f(i). f(j)) 


Measure of the conditional prob- 
ability on weak evidence 


R 


Max (f(i). m) 


Measure of rectangular distance 
between the terms 


P 


-* 


Measure of the pr-^portlon overlap 
between the Urm 






L 


v.- 4^ 


Measure of the linear correlation 



155 



strategies, an experiment was performed on the ILR DRS over differ- 
ent stages of its development. The comparative performance of the 
test Document Retrieval System between stage 1 and 2 is based upon 
a set of common questions and three word association files and 
direct match searches. 

From the tabulated data in Table 5 JO and the plot of the meas- 
ure of coefficient of word association 6 in Fig. 5.44, the dynamic 
property of the coefficients of association can be seen. In a11 
cases, the S-*measure produced less output as the corpus and thesau- 
rus increased in size from stage 1 to stage 2. This is a result of 
both the measure and the laboratory search routine. That is, the 
denominator of the measure is directly proportional to any increase 
in corpus size, hence making the measure smaller with increasing 
corpus size, as the numerator increases at a much slower rate. The 
laboratory search routine employed also contributes to this decrease 
in output in that it has a default relevance threshold condition that 
ignores any documents that do not have a relevance value to the 
query, measurable in the first three significant digits. Hence any 
document without a relevance measure in the first three significant 
digits will not be retrieved. 

On the other hand, the measure G provided an increase in output 
for all questions from stage 1 to stage 2. The W-measure provided 
no increase for two cases, and a slightly larger set for two cases. 

It is interesting to note that the intersection of the output 
sets (see Table 5.10) is surprisingly ..all, for the same measure 
and same question for the two stages, clearly, some documents that 



156 



8 




o o o o o 

m CO CN4 



4nd4no 4uaiun3op jo X|!4UDnQ 



157 



Table 5.10 
QUANTITY OUTPUT FOR STAGE 1 AND STAGE 2 









Cardinal Measure 






Inquiry 


Coeff. of Assoc. 
neasure 




Output 
Set 


Intersection 


Union 






s 


1 

2 


3 
2 


1 


3 




1 


6 


1 

o 
c 


2 
11 


2 


11 






U 


1 

2 


2 
2 


1 


3 






Direct Hatch 


1 
2 


1 
1 


1 


1 






S 


1 

O 

c. 


4 

2 


1 


5 




2 


6 


1 

o 
c 


17 
32 


15 


34 






H 


1 

2 


8 
14 


8 


14 






Direct Match 


1 

2 


z 

2 


2 


2 






S 


1 
2 


16 
11 


3 


24 




3 


6 


1 

Z 


1 

23 


1 


23 






H 


1 
2 


2 
2 


2 


2 






Direct Hatch 


1 
2 


2 
2 


2 


2 






S 


1 
2 


4 

0 


0 


4 




4 


6 


1 
2 


1 

13 


0 


14 






U 


1 
2 


0 
3 


0 


3 






Direct Hatch 


1 

2 


0 

0 


0 


0 





158 



the system attributed as being relevant to a question In stage 1 are 
not being retrieved In stage 2. The cases for this difference In 
content of the output sets Is a characteristic of the sensitivity of 
the different measures to system growth. 

The experiment does show that the change In output performance 
with system growth Is certainly non-linear (see Fig. 5.44). And, 
further. If one Ignores the S-measure It can be seen from the G- and 
W-measures, and by examination of denominators of some of the other 
candidate measures In Table 5.9, that the output set will always be 
as large and. In the majority of cases, much larger for the same ques- 
tion as the system grows. 



159 
Chapter 6 

CONCLUSION AND SYNTHESIS OF FINDINGS 

6.1 INTRODUCTION 

The purpose of the analyses in the previous chapters is to pro- 
vide a basis for the development of management and design aids for 
DRSs, through the investigation of fundamental relationships between 
the components of DRSs. 

The objective of this chapter is to summarize and synthesize 
those findings and to discuss their implications for DRS management 
and design. 

6.2 GENERAL CONCLUSIONS 

On the basis of the experiments and analysis reported in Chapter 
5, it is concluded that retrieval quantity can be predicted, and that 
the underlying characteristics which permit the estimation have 
potential as DRS management and design aids. 

To briefly review, the findings made are believed to hold for 
a wide range of DRSs, such as: 

Corpus size: 100 to 50,000 

Thesam size: 300 to 13,000 

Term Frecvency of Use: 1 to 4,200 
They are based on the detailed analysis of a representative sample 
DRS from this range, and consist of the following: 

(1) The MEZ canonical form of f(r) = K(rfB)'°^ characterizes the 
term-frequency-of-use versus term rank distribution for a 
wide range of manipulative index DRSs. The parameters K,B 



160 



and a are estimated as a function of corpus size, thesau- 
rus size and depth of indexing. 

(2) Term- term co-occurrences are not generated by random sam- 
pling from the thesaurus. 

(3) The value of term-term co-occurrences is directly propor- 
tional to the function of the product of the frequencies of 
use of the terms, and can be predicted by the relationship 

TXT(i.J) = v(iiV^) 

where y is defined as a fur.ction of term frequency of use 
and corpus size. 

(4^) The Retrieval Quantity of a formal inquiry can be accurately 
predicted as a function of y, term frequency of use and 
corpus size. 

(5) For the class of coefficients of association of the form 
(see Kuhns (81)) 

C (i,j) 
a a 

the nunierator, 

= TXT(I.J) - fti^ 
can be estimated by 

6'(i.j) = (Y-l)(^^^^i^) 



161 



(6) The probability that two terms, with frequencies o^ use 
greater than or equal to one, will co-occur is definable by 
an ordered family of curves with an upper and lower bound 
as indicated in Fig. 6.1. Each curve is a function of the 
frequencies of use of the two terms, the number of terms 
with the same frequency of use, and the size of the corpus. 

(7) Terms with the same frequency of occurrence, have similar 
DRS statistical properties; that is, the distribution of 
the number and value of their term- term co-occurrences are 
approximately the same. 

(8) The impact of DRS corpus and thesaurus grcwth on retrieval 
quantity can be predicted. 

6.3 MANAGEMENT AND DESIGN AIDS 

The management of a DRS entails cost/benefit analysis of system 
operations and plans, measuring system erformance for different 
tasks, and controlling the system processes. It is not the intent 
to delve into a discourse on DRS performance evaluation, but rather 
to describe h m the findings (summarized above) can be used to aid 

in some,aspects of DRS management and design. 

(1) Tuning Inquiries. By estimating R^ for an initial inquiry, 
the grammatical combinations and/or nunfcer of terms can be 
modified to yield different expected R^'s. Through this 
pre-processing exercise the DRS user can adjust inquiries to 
retrieve a more preferred quantity of references. In this 
way the marginal effect on quantity output of adding or de- 
leting a term of a certain frequency of use, and creating 



162 



f(j) = D Theoretical upper 




ix' numoer of terms with the same frequency of use 



pjg 6J — Theoretical fatnily of curves defining the lower bound 
of the probabiliW of co-occurrence of two terms with 
f(i)=l, l<f(j)<D, and l^jx^D, \y=\ 



163 



different logical combinations can be estimated. 

By employing such a "tuning" measure it Is quite likely 
that the DRS users will find the system more understandable 
and convenient, and management can reduce the potential 
number of user disappointments In system responses. 
(2) Predicting and Monitoring the Impact of System Growth . As 
the system corpus and thesaurus change over time, both the 
quality and quantity of the system output will also change, 
for a constant set of Inquiries. The measure can be 
used to estimate the Impact of corpus and thesaurus change 
on -he system output quantity. The most straightforward 
application Is to determine the set of y factors for an op»- 
eratlonal DRS with a specified corpus and thesaurus size, 
and then iS D Is Increased to project a pruportlonate In- 
crease In the Y-fdctor bounds. The new ys can be used to 
estimate the changes In R^, for a specific Inquiry. Using 
the Rq measure In this way provides some Insight Into the 
dynamic characteristics of DRSs. 

One could also use the R^ measure to estimate the Impact on 
-jQUtp^ujLquantlty due to d -^nges In the thesaurus with the 
corpuf held constant. In this process, the frequency of use 
of the thesaurus terms would be changed, and/or new terms 
added. The bounds of the y-f^ctor would remain the same, 
but the likely value of y for high f(1) would change, and 
the technique for estimating the new y^s Is directly analo- 
gous to^hat used In Section 5.64 to Illustrate the 



164 



P(TXT(i.j)) > 0 distribution. 
(3) Indexing Process Modification . There are various controls 
that can be imposed on the indexing process, and the meas- 
ure tjn be used to estimate the effect of changes in con- 
trol limits on the quantity output. For example, a manager 
or designer may want to: 

a. Truncate the index term frequency of use distribu- 
tion by specifying f(i)^^n and/or H^)^^ limits. 
The impact, on quantity outputs, of changing the 
values of ^CD^^n/n^x estimated by computing 
Rq at the different values, for a set of typical 
inquiries. 

b. Limit the minimum or maximum nuirber of terms that 
can be used to describe any one document. An inter- 
esting condition to investigate is to alter the 
"current" depth of indexing, Dg, lower and upper 
bounds so as to gradually approach a uniform distri- 
bution in which Dc ' h ' The sensitivity of 

min max 

the quantity output to the rate of change of the 
depth of indexing distribution can be estimated by 
the Rq measure, because the frequency of term use, 
f(i), distribution is indirectly altered and R„ is 
a function ci' the values of f(i). 

c. Specify a limit or a certain distribution on the 
number of terms that can have the same frequency 
of use, over the term-rank space {1,...,D}. By 



165 



altering the value or distribution, the P^(TXT 
(IJH). 0 s V s (f(1).f(j))„^„ and P^(y«Z). y^.b. 
< I i Yy B, probability distributions are changed, 
and consequently the quantity output for any one 
Inquiry will also be modified. The impact can be 
estimated by R^, because it is a function of the 
various ys related to the terms in the inquiries. 
(4) Inquiry Processing Effort. Given a specifiable file struc- 
ture and an elapsed time distribution for term lookups, the 
number of iteratioris involved in the determination of R^' 
can be used to estimate the average amount of time to pro- 
cess an inquiry* This information could be used by a ORS 
manager or designer to estimate certain resource require- 
ments necessary to satisfy existing or projected user de- 
mands. 

The above exemplary management applications of the R^ measure 
can also be viewed in the context of a design process. Combining 
these applications with certain canonical expressions, noted in 
Chapters 4 and 5, that characterize the fundamental relationships in 
ORSs, one can construct a hypothetic sequence of steps which illus- 
trates their use in the design process. Further this procedure can 
be considered as a basis for a simulation model that would enable a 
desiper to experiment with different parameter values and variable 
limits, prior to the construction of the ORS. The steps envisages 
are as follows: 



166 



(1) Selection of Corpus Topic 

a. Analysis of user needs 

b. Selection of the published subject arej of Interest; 
fcr example, the field of Operations Research. 

(2) Identification of Periodical Population and Determination 
of Periodical Productivity Distribution 

a. Determination of the tradeoff between number of periodi- 
cals to be collected versus the percent of the relevant 
literature covered, by applying Bradford's Law of Scatter 
(88), Kendall (75) has in fact Investigated the peH- 
odical productivity distribution for Operations Research 
and found that if one collected the five most productive 
journals, 33 percent of the new articles (documents) 
would be captured, or the eighteen most productive jour- 
nals, 50 percent of the new articles would be capture'^, 
or the 67 most produru /e journals would yield 75 per- 
cent of the new articles, etc. 

b. Estimation of the expected growth rate of the literature 
in the' field, and conversely, the death or deletion rate. 
In most cases a sample exponential form ci. in Fig. 1.5 can 
be utilized. 

(3) ' Estimation of the Corpus Size D 

a. From the determination of the required nunber of peri- 
odicals to be collected, an estimate of the initial cor- 
pus size, D, can be made. 



167 



(4) Selection of Candidate Tenn Frequency of Use Distributions 
a. The most convenient relationship to employ is the MEZ 

canonical form, with the parameters K, B and a determined 
- as in Section 5.3 that is compatible with a corpus of 
size D and se''ected average depth of indexing (e.g., 
' 15 terms per document). 

(5) Determination of the Probability of Teirm Co-occurrence 

a. As a function of the term frequencies of use (f(i)), the 
size of the corpus (D), and the distribution of the num- 
ber of terms with the same frequency of use (estimated as 
* 

in Section 5.3.2), the probability of two terms with 
frequencies of use f(i),f(j) ro-occurring can be deter- 
mined, as discussed in Section 5.5. 

(6) Derivatior of the y-Factors for R 

a. Based on the information determined in steps 4 and 3, the 
Y-factor distribution can be derived as shown in Section 
5.4.1. 

(7) Generate Sample Inquiries 

a. A set of "typical" inquiries, from the point of view of 
form, (and not content), ran be constructed using combi- 
nations of Boolean connectors and terms with various fre- 
quencies of use as specified by the MEZ distribution. 



An alternative approach is wu employ the Waring distribution; 
see Herdan (64, 65) and Jones (73) for a discussion of this distri- 
bution. 



168 



(8) Estimation of Quantity Output, 

a. Using the y-factor distribution and the procedure devel- 
oped in Section 5.4.1, the quantity output for the candi- 
date inquiries can be predicted (for a direct rntch search 
strategy) . 

(9) Measurement of the Sensitivity of to; 

a. Changes in the corpus and thesaurus size 

b. Changes in the MEZ parameters 

c. Changes in the distribution of the number of terms with 
the same frequency of use 

d. Changes in search strategy 

The standard process of designing DRSs is considerably more art 
than science, with many system variables and relationships at best 
indirectly controlled or left to assume "natural" values by implicit 
default options. This process can be improved by simply taking ad- 
vantage of the statistical regularities that characterize the rela- 
tionship among DRS parameters. The hypothetic design sequence des- 
cribed above is one way in which the design process can be made mope 
fovmal and accurate. Also it provides a basis for a structure within 
which a designer can explait the various canonical forms that char- 
ac^fij^ze the statistical stability of various DRS properties. 





169 



Chapter 7 

RECOMMENDATIONS FOR ADDITIONAL R'ISEARCH 

7.1 INTRODUCTION 

•There are a number of directions for future research in the area 
of analytic/simulation modeling of Document Storage and Retrieval 
Systems* Several suggestions are briefly noted in this chapter in 
the hope that they will provide a point of departure for one or more 
subsequent research efforts. 

7.2 CORPUS HOMOGENEITY AND HETEROGENEITY 

The DRSs investigated in this study are basically homogeneous in 
subject content; that is to say, the corpus is dedicated to a single 
subject. The ILR DRS has a homogeneous corpus and the subject is In- 
formation Science. A measure to distinguish between a homogeneous 
and heterogeneous corpus has yet to be developed. Also, a means of 
measuring the impact of more or less heterogeneity on DRS performance 
is needed. 

Presumably, a measure could be based in part on the character- 
istics of the DXD matrix, which is defined by the operation 

(DXT)(DXT)^. 

The DXD matrix gives the document-document association profiles, and 
presumably in a homogeneous corpus the majority of documents would be 
highly associated. The converse woul^ hold for a heterogeneous cor- 
pus. . 



170 



7.3 DISTRIBUTION OF TERMS WITH COMMON FREQUENCIES OF USE 
Little, if any, control is ever exercised over the number of 

terms allowed to have the same frtquency of occurrence, Jx. From the 
MEZ relationship, the Waring distribution (see Herdon (64, 65)) and 
Zipf's two "Laws "(see. Booth (10)), there is an implied increase in 
Jx as the rank of the term decreases. This simply means that there 
will be more terms that are used infrequently than there are terms 
that are used frequently. The issue of interest is, what should Jx 
be for a specified term rank and for certain system characteristics ~ 
D and T, and what is the impact of Jx on DRS performance. 

It i- clear that Jx has a marked impact on the probability of 
co-occurrence of terms with frequencies of use f(i), f(j). This is 
illustrated in Fig. 5.43, in which the thaoretical lower bound of the 
actual P(TXT(i,j)/f(i),f(j) > 0) is plotted for f(i) = f(j) = 1 and 
1 ^ Jx < D. The various formulae presented in Sec. 5.5 provide a 
point of departure, for any additional computations of P(TXT(i,j) = 
S) for a specific f(i), f(j) and Jx. 

7.4 THE MEZ CANONICAL FORM 

Mandelbrot (94, 95, 96), Herdan (64, 65), Zipf (153), and 
Krevitt (80) have investigated various term usage relationships, pri- 
inarily in a text-free setting. For thac unconstrained setting, the 
MEZ exponent o is considered always to be in the range of 1 a a ^ 
1.6. However, the system vocabularies of DRSs are very constrained. (in 
the predicate calculus sense), and for the test system a very good 
fit between the MEZ and the term frequency of use versus ran, urve 



171 



was possible with o = 0.9. Clearly if one were to reduce o to zero, 
the frequency of use versus rank distribution would yield a uniform 
distribution. Intuitively then as one reduces o one constrains the 
"richness" of the vocabulary. Noteably, Mandelbrot (94) has observed 
that in children's talk (an example of constrained vocabularies of a 
different type) it is possible for o s i. The issues of interest 
are: What should a be in order that the DRS perform well, and how 
can one best adjust the DRS to move toward a more preferred term fre- 
quency of use situation? And, as the DRSs grow over time, what 
changes can be expected in the parameters K, B and a. 

7.5 DEPTH OF INDEXING DISTRIBUTION 

The depth of indexing distribution portrays the frequency dis- 
tribution of the assignment of terms to documents. Of the systems* 
on which. empirical data was available, the basic form of the distri- 
butions is very similar; in fact, sufficiently similar for one to sus- 
pect that a canonical form should exist. On the .jasis of a crude 
fit, the Beta distribution: 

f(w,x,*)= ii||l]lLw^l.w)* 

where, w is the normalized depth of indexing level defined over the 
finite interval 0 < w < 1 , and x and «j» are constants. Wiederkehr 
(143) has developed certain forms for a modified Beta distribution In 

his discussion of search characteristic curves. Also, Bourne (13), 

J 

* 

The ILR test;:,|y§tem and the systems investigated by Litofsky 

(90). 



172 



Svenonius (127), Swanson (12'^ and Zunde (151) have explored various 
aspects of the depth of indexing distribution. However, no general 
formulation of the expected or likely depth of indexing distribution 
has been devctlopcd, and just as importantly there is no establ shed 
means of linking the depth of indexing characteristics with the term 
frequency of use distribution, and the DRS performance. 

7.6 HIGHER ORDER TERM ASSOCIATIONS 

The vast majority of discussions (this paper included) dealing 
with term- term associations just employ the first order TXT matrix 
relationships. As noted in Chapters 4 and 5, the ielements TXT(i,j) 
provide the degree of association between terms i and j, which is 
also the first order of association. To obtain the higher order 
associations between two terms, one merely takes the appropriate 
power of the TXT matrix. That is, (TXT) yields the n"* order asso- 
ciation between the terms in the thesaurus. Salton (117) has sugges- 
ted a scheme to utilize the higher order associations for expanding 
an initial inquiry. The procedure entails a weighting factor o, 
where 0 < o < 1 which, causes o" to be a monotonically decreasing func- 
tion as n increases. This condition implicitly states that the lower 
order associations are more important than the higher order associ- 
ations. Employing Sal ton's notion of a normalized query vector, Q, 
one then gets the following relationship between an expanded query 
Q^, and the original query Q; 

= QCl + {a(TXT)}^ + {a(TXT)}^ + ... + {o(TXT)}"]. 



173. 



Given that this type of relationship is valid, what are the reason- 
able values of a and n, and what are their effects on the performance 
of the DRS? 

IJ Rq MODEL EXTENSIONS 

Given the basic construct of the model, it is of interest 
to consider how the model can be extended to deal in some way with 
the issue of relevance • 

The most logical step is to employ some means of ranking the 
documents by degree of inquiry term/document descriptor overlap -or 
associative thresholds, or by the weak ordering action .qgested by 
Cooper (35) • The important procedure is to link the output set 
with a relevance measure, which in this case .ould be system defined 
(as opposed to user judgment). Obviously, the simplest case is for 
a direct match search strategy in which the documents retrieved that 
atisfy any explicit or implied conjunction corrbination of terms in 
. inquiry would be judged the most likely relevant subset, and the 
documents generated by the disjunctive arguments in the inquiry less 
likely to be relevant. The analogous argument would hold for a word 
association. search strategy. This elementary ranking of the output 
set would yield at best a binary relevance mapping on R^, which is ^ 
less discrimihdting than desired. 

A more sophisticated approach would be to employ a probabilistic 
mechanism in the DXT matrix that would reflect both the fundamental 



174 



indefiniteness* in the indexing term selection process, and the 
sti'ength of the term-document assignment. Thus given a term-document 
relevance "weighting" one could introduce relevance thresholds in the 
Rq iterative procedure, and potentially rank the output set. The 
probabilistic structures put forth by Maron and Kuhns (97) '-d Bryant 
(23) appear to be most appropriate. 

7.7.1 Psychological Analogies 

A rather innovative extension of the model structure is to at- 
tempt to characterize the conceptual "dual" or analogous psychologi- 
cal process experienced by humans in searching for or processing in- 
formation, by a similar model construction. That is to say, there 
are certain regularities that characterize Document Retrieval Systems, 
and it is of interest to know whether these are analogous regulari- 
ties that characterize the human thought process of information stor- 
age and retrieval, and, in particular, indexing and abstracting pro- 
cesses. 

There appears to be a sound, though largely unexploited, logical 
basis upon which to investigate the above notion. For example, the 
MEZ relationship is known to characterize the work frequency of occur- 
rence and rank distribution of a variety of languages. In fact, 

*Thfs indefiniteness arises more' from a type of intrinsic uncer- 
tainty or ambiguity than from statistical variation — a sort of 
"fuzzy" -membership of a term to a document descriptor set (see Zadeh . 
(151)) for a fuller discussion. 

"kit 

Suggested by Professor F. N. Nicosia, GraHuate School of Busi- 
ness Administration, University of California, Berkeley. 



175 



Mandelbrot (94, 95, 96) (see also Brillouin (18)) derived that rela- 
tionsrtip employ .g the notion of the "cost" of a word as the indica- 
tor of its likelihood of use. The hypothesis is that the less costly 
words are used more often than the more costly, where cost is a sur- 
rogate for "effort" to use. Also, Zipf (153) presented the "law" of 
term rrequency of use versus rank within the context of his theory on 
Human Behavior and the Principle of Least Effort (153). An attempt 
was made by Rosenberg (115) to utilize the Zipf relationship for pre- 
dicting index term selection for use, but the performance of that 
model clearly needs to be Improved before an operational construct 
can be developed. It would seem that a weighted Bayesian or condi- 
tioned probability structure is needed to accommodate the many de- 
grees of semantic uncertainty and noise embedded in document discus- 
sions, human communication and indexing. 



176 



BIBLIOGRAPHY 



1. A. D. Little, Inc., Centralization and Documentation , July 1963. 

2. , Appendices to Centralization and Documentation , 

July 1963. 

3. Artandi, S., An Introduction to Computers in Information Science , 

Jhe Scarecrow Press, Inc., Metuchen, New Jersey, 1968. 

4. Baker, N. R., and R. E. Nance, "Organizational Analysis and Simu- 

lation Studies of University Libraries: A Methodological Over- 
view," Information Storage and Retrieval , Vol. 5, 1970. 

5. Barhydt, G. C, "A Comparison of Relevance Assessment by Three 

Types of Evaluations," American Documentation Institute: 
Parameters of Information Science , Spartan Books, Washington, 
D. C, 1964. 

6. Becker, J., and R. Hayes, Information Storage and Retrieval: Tools, 

Elements, Theories , John Wiley, New York, 1963. 

7. Bernier, C. L., "Correlative Indexes versus the Blank Sort," 

American Documentation , Vol. 9, 1958. 

8. - Beyer, W. H. (ed.). Handbook of Tables for Probability and Statistics , 

The Chemical Rubber Company, 1966. 

9. Birkoff, 6., Lattice Theory . American Mathematical Society, Providence, 
Jhode Island, 1966. " 

10. Booth, A. p., "A 'Law' of Occurrences for Words of Low Frequency," 

Information and Control , Vol. 10, No. 4, April 1967. 

11. Borko, H., Evaluating the Effectiveness of Information Retrieval 

Systems , System Development Corporation. SP 909, August 1962. 

12. Bourne, C, "The World's Technical Journal Literature: An Estimate 

of Volume, ^Origins, Language, Field Indexing, and Abstracting," 
American Documentation , April 1962. 

13. , Methods of Information Handling , John Wiley, New 

York, 1966. . - 

14. , "Evaluation of JudexjDg Systems," Annual Review of 

Information Science , Vol. 1, Interscience, New York, 1967. 

.15. . , et. al.. Requirements, Criteria, and Measures of 

Performance of Information Storage and" Retrieval Systems , 
Stanford Research Institute, December 1961. 

16. Bradford, S. C, Documentation , Crosby Lockwood, London, 1948. 

17. Brandhorst, W. T., "Simulation^of Boolean Logic Constraints 

Through the Use of Term Weights,** American Documentation , Vol. 
17, No. 3, 1966. 

18. Brillouin, L., Science and Information Theory , Academic Press, 

New York, 1962. 



177 



19* Brookes, B* "The Growth, Utility, and Obsolescence of Scientific 
Periodical Literature," Journal of Documentation , Vol. 26, No. 4, 
December 1970. 

20. , "The Derivation and Application of the Bradford- 

Zipf Distribution," Journal of Documentation , Vol. 24, No. 4, 
December 1968. 

21. ^ , "Obsolescence of Special Library Periodicals 

Sampling Errors and Utility Contours," Journal of ASIS , September/ 
October 1970. 

22. - , "The Desi5n of Cost Effective Hierarchical In- 

formation Systems," Information Storage and Retrieval , Vol. 6, 
1970. ^ 

23. Bryant, E. C. (ed.). Evaluation of Document Retrieval Systems: 

Literature, Perspective, Measurement,^ Technical Papers , Westat 
Research, Inc., December 31 , 1968. 

24. , et. al.. Associative Adjustments to Reduce Errors 

in Document Screening , Westat Research, Inc.. 1967. 

25. , "Modeling in Document Handling," Electronic Handling 

of Information , Kent and Taulbee (eds.). Academic Press, London, 
1967. 

26. Bush, v., "Memex Revisited," Science is Not Enough , New York, 1967. 

27. , "As We May Think," Atlantic Monthly , Vol. 176, July 1965. 

28. Carter, L. F., et. al.. National Document Handling Systems for 

Science and Technology , John Wiley, New York, 1967. 

29. Churchman, C. W., The Sy ste ms Approach , Delacorte Press, New York, 

1 968 . 

30. Cleverdon, C. W. , F. W. Lancaster, and J. Mills, "Uncovering Some 

Facts of Life in Information Retrieval," Special Libraries , 
Vol. 55, February 1964. ~ 

31. , Report on the Testing and Analysis of an In- 
vestigation into the Comparative Efficiency. of Indexing Systems , 
The College of Aeronautics, Cranfield, England^ October 1962. 

32. , Factors Determindng the Performance of Indexing 

Systems, Volume 1 and 2 , The College of Aeronautics, Cranfield. 
England, 1966. 

33. Cooper, W. S., "A Definition o*^ Relevance for Information Retrieval," 

Information Storage and Retriev al, Vol. 7, No. 1, June 1971. 

34. , "On Deriving Design Equations for Information Re- 

trieval Systems," Journal of ASIS , November/December 1970. 

35. , "Expected Search Length: A Single Measure of Re- 
trieval Effectiveness Based on the Weak Ordering Action of Re- 
trieval Systems, yAmerican Oocum- .ation . Vol. XIX, January 1968. 

36. Cuadra, C, and R. Katter, "Opening the Black Box of Relevance," 

Journal of Documentation , Vol. 23, 1967. 



178 



37. Cuadra, C. A., " dn the Utility of the Relevance Concept , System 

Development Corporation, SP-ISQB, Santa Mo'iica, California, March 
1964. 

38. Cuadra, C, and R. Katter, et. al.. Experimental Studies of 

Relevance Judgment , System Deve^lopment Corporation, Santa Monica, 
California, 1967. '^'^ 

39. Curry, H. B., 8. Feys, and W. Craig, Combinatory Logic, Vol. 1, 

North Holland Publishing Company, Amsterdam, 1968. 

40. DeLuca, A., and S. Termini, "A Definition of a Nonprobabilistic 

Entropy in the Setting of Fuzzy Sets Theory," Information and 
Control , Vol. 20, 1972. 

41. De Solla Price, D. J., "Nations Can Publish or Perish," Science and 

Technology^ October 1967. 

42. • , \ittle Science, Big Science , Columbia University 

Press, New York, 1963. 

43. » Science Since Babylon , Yale University Press, 

New Haven, Connecticut, 1961. 

i 44. Doyle, L. B., "Indexing and Abstracting by Association," American 

Documentation , Vol. 13, 1962. 

45. , !s Relevancy an Adequate Criterion in Retrieval 

f System Evaluation , System Development Corporation, SP-1262, 

f Santa Monica, California, July 1963. 

i 46. Dym, E. D., "Relevance Predictability: Investigation into Back- 

l ground and Procedure," Electronic Handling of InformaMon , Kent 

f and Taulbee (eds.). Academic Press, London, England, 1967. 

I 47. Fairthorne, R. A., "Basic Parameters of Retrieval Tests," American 

I Documentation Institute Pararraters of Information Science , 

t Spartan Books, Washington, D. C, 1964,. 

I 48. , Towards Information Retrieval , Butterworths, 

I London, England, 1961 . 

I 49. , "Progreso in Documentation," Journal of Docu- 

r mentation. Vol. 25, No. 4, December 1969. 

f 50. Feller, W., Introduction to Probability Theory and Its Application , 
r Vols. 1 and 2, John Wiley^^New York, 1950. 

I 51. Fisher, G. H., Cost Considerations in Systems Analysis , American 
I Elsevier, New York, 1971 . 

1^ 52. Foskett, A. C, The Subject Approach to Information , Archon Books, 
1? Handen, Connecticut, 1969. 

53. Gazale, M. J., "Irredundant Disjunctive and Conjunctive Forms of a 
Boolean Function," IBM Journal of Research and Development , Vol. 1, 

U • 1957. 

54. Guiliano, V., "The Interpretation of Word Association," Symposium 
on Statistical Association Methods for Mechanized Documentation , 
March 17-19, 1964. 



179 

55. Giuliano, V. E., and P. E. Jones, Study and Test of a Methodology 

for Laboratory Evaluation of Message ^trl eval Systems . CFSTI. 1 966 . 

56. Goffman, U. , and V. NewlU, "Methodology for Test and Evaluation 

of Information Retrieval Systems," Information Storage and Re- 
trieval . Vol. 3, 1966. 

57. Goffman, W. , and K. Warren, "Dispersion of P?pers Among Journals 

Based on a Mathematical Analysis of Two Diverse Medical Litera- 
tures," Nature , March 29, li969. 

58. Good, I. J., "The Decision Theory Approach to the Evaluation of 

Information Retrieval Systems," Information Storage and Retrieval . 
Vol. 3, August 1966. 

59. Gottschalk, C. F., and D. Desmond, "Worldwide Census of Scientific 

and Technical Serials," American Documentation , July 1963. 

60. Groos, 0. V., "Bradford's Law arid the Keenan-Atherton Data," 

Journal of American DocmngntaTtion , Vol. 19, No. U 1967. 

61. Gull, C. D., "Seven Years of Work on the Organization of Materials 

in the Special Library," American Documentation , Vol. 7, October 
,1956. 

62. Harlow, J., and P. Abrahams, An I nvestigation of the Techniques and 

Concepts of Information Retr1ev 3^ Final Resort, Signal Corps. 
July 31, 1964. ^ 

63. Hayes. R. M,. "Mathematical Models ir Information Retrieval." 

Natural Language and the Computer . Garvin (ed.). McGraw Hill. 
New York. 1§63; 

64. Herdan. .6. . The Advanced Theory of Language as Choice and Chance . 

Spr 1 nger-Verlag. New York, 1966. 

65. r_ . Quantitative Linguistics . Butterworths. London. England. 

1964. 

66. Hertz. D. B.. Research Study of Criteria and Procedures for Evaluating 

Scientific Information Retrieval Systems . Authea Anderson and 
Co.. March 1962. 

67. Holt. C. and W. Schrank. "Growth of the Professional Literature 

in Economics and Other FieVr.. and Some Implications." American 
Documentation . January 1968. 

68. Houston. N.. and E. Wall. "The Distribution of Term Usage in Manipu- 

lative Indexes." American Documentation . April 1964. 

69. IFD. The ASLIB Cranfield Research Project Report A 7A8 . 1968. 

70. Iker. H. P.. "Solution of Boolean Equations Through the Use f 

Term Weights to the Base Two." American Documentation . January 
1967. / . 

71. Jahoda. G.. Information Storage and Retrieval Systems for Individual 

Researchers . Wiley-Interscience. New York. 1970. 



4 



180 



72. Jones, P. E., >nd R. M. Curtice, "A Framework for Comparitig T 

Association Measures, American Documentat-ion , July 1963. 

73. Jones, P. E. , et. al.. Papers on Automatic Language Processing: 

Selected Col l ection Statistics and Data Analysis , A. D. Little, 
February 196/. 

74. Katter, R. V., "Design and Evaluation of Information Systems," 

Annual Review of Information Science and Technology , Vol. 4, 
C. Cuadra (ed.). Encyclopedia Britannica Inc., Chicago, 1969. 

75. Kendall, M. G., "The Bibliography of Operations Research," Opera- 

tional Research Quarterly , Vol. 11, No. 1/2. 

76. Kessler, M. M. , Technical Information Flow Patterns , Lincoln 

Laboratories, lilT. 

77. King, D. W. , Evaluation During File Development of the Glass Tech- 

nology Coordinate I ndex , U.S. Department of Commerce. Patent 
Office, November i~^7. 

78. Kochen, M., "System Technology for Information Retrieval," The 

Growth of Knowledge; Readings on O'^*")zation and Retrieval of 
Information , John Wiley & Sons, New York, 1967. 

79. Krauze, T. U. , and C. Hillinger, "Citations, References and the 

Growth cf Scientific Literature: A Model of Lynamic Interaction," 
Journal of AS IS , September?October 1971. 

80. Krevitt, B. , and B. C. Griffith, "A Comparison of Several Zipf- 

Type Distributions in Their Goodness of Fit to Language Data," 
Journal of ASIS , May-June 1972. 

81. Kuhns, I. L., "The Continuum of Coefficients of Association," 

Statistical Association Methods for Mechanized Documentation , 
Symposium Proceedings , Spartan Book*, Washington, D. C, 1964. 

82. Lancaster, F. W., Evaluation of the Medlars Demand Search Service . 

National Library of Medicine, 1968. 

83. , Information Retrieval Systems: Characteristics, 
Testing and Evaluation , John Wiley, New York, 1968. 

84. Lancaster, F. W. , and W. D. Climenson, "Evaluating the Econometric 

Efficiency of a Document Retrieval System," Journal of Documenta- 
tion , Vol. 24, No. 1. 1968. 

85. , "The Cost-Effectiveness Analysis of Information 

Retrieval and Dissemination Systems," Journal of ASIS , January- 
February 1971. 

86. Lesk, M. , and G. Salton, "Interactive Search and Retrieval Methods 

Using Automatic rnforfmitlon Displays," Proceedings of the Spring 
Joint Computer Conference , 1969. 

87. , "Word Association in Docunent Retrieval Systems," American 
Documentation , January 1969. 

88. Leimkuhler, F. F.,"The Bradford Pi s tri bu ti on , "- JoumaJ-of, Toe .(iien ta- 

tioh . Vol. 23, No. 3, September 1967. 



181 



89. Line. M. B..-"The Half-Life of Periodical Literature: Apparent 

and RealObsolescence," Journal of Documentation. Vol. 26. March 
1970. 

90. Litofsky. B.. Utility .of Automatic Classification Systems for In- 

formation' Storage and Retrieval . Ph.D. Thesis. University of 
Pennsylvania. May 1969. ~ 

91. Long. J. M.. et. al.. "Dictionary Buildup and Stability of Word 

Frequency in a Specialized Medical Area." American Documentation. 
January 1967. 

92. Lowe. T. C. Design Principles for an On-Line Information RetrievaP 

System. Moore School Report No. 67-14. The Moore School of " 
tiectrical Engineering. Philadelphia. Pennsylvania^ 1966. 

- 93. Mcluhan. M. . The Gutenberg Galaxy. University of Toronto Press. 
Canada, 1962. 

94. Mandelbrot. B.. "An Informational Theory of the Statistical Structure 

of Language." Communication Theory. Jackson Willis (ed.). Academia 
Press. New York. 1953. 

95. . "Simple Games of Strategy Occurring in Communication 
Through Natural Languages." Transactions of the I.R.E.. Professional 
Group on Information Theory. Vol. 3. March 1954. 

96. . "On the Theory of Word Frequencies and on Related 

Markovian Models of Discourse." Proceedings of the Twelfth Sym- 
posium in Applied Mathematics. R. Jakobson (ed.). American 
Mathematical Society. Rhode Island. 1961. 

97. Maron. M. E.. and J. L. Kuhns."On Relevance. Probabilistic Indexing 

and Information Retrieval." Journal of ACM . Vol. 7. No. 3. 1960. 

98. Maron. M. E.. A. Humphry, and J. Meredith. An Information Processing 

Laboratory for Education and Research In Library Science; Phase I. 
Institute of Library Research. University of California. Berkeley. 
June 1969. 

99. Marron. H.. "On Costing Information Services." Proceedings of the 

American Society for Information Science . 1969.. ^ 

100. Martyn. J., and B. C. Vickery. "The Complexity of the Modelling of 

Information Systems." Journal of Documentation . Vol. 26. No. 3. 
September 1970. 

101. Meadow. C. T.. The Analysis of Information Systems— A Programmer's 

Introduction to Information Retrieval. John Wiley. New York. 1967. 

102. Mood. A. M. . and F. A. Graybill. Introduction to the Theory of 

Statistics. McGraw Hill Book Company. New York. 1963. 

103. Mott. T. H.. Jr.. "Determination of the Irredundant Normal Forms 

of a Truth Function by Iterated Consensus of the Prime Impllcants." 
I.R.E. Transactions on electronic Computers . Vol. EC-9, 1960. 



182 



104. Oettinger, A. G., "An Essay in Information Retrieval or the Birth 

of a Myth," Information and Control , Vol* 8, 1965. 

105. Overmeyer, L., "An Analysis of Output Costs and Procedures for an 

Operational Searching System," American Documentation, Vol. 14, 
1963. 

106. Perry, J. W., A. Kent, and M. M. Berry, Machine Literature Searching, 

Interscience, New York, 1956. 

107. "Proceedings of the American Documentation Institute," Annual 

Meeting, Vol. 4, New York, October 22-27, 1967. 

108. Ouine, W. V., "A Way to Simplify Truth Functions," Amerfcan Mathe-> 

matics Monthly , Vol. 62, 1955. 

109. ' , "The Problem of Simplifying Truth Functions," American 
Mathematfcs Monthly, Vol. 59, 1952. 

110. Ranganathan, S. R., The Colan Classification, Vol. IV, Rutgers Series 

on Systems for the' Intellectual Organization of Information, S. 
Artandi (ed.), Rutgers University, 1965. 

111. Raver, N., "Performance of Information Retrieval Systems," Information 

Retrieval, George Schecter (ed.), Thompson Book Company, Washington, 
D. C, 1967. 

112. Rees, A. M., "Evaluation of Information Systems and Services," 

Annual Review of Information Science and Technology, Vol. 2 , 
C. Caudra (ed.), Interscience, New York, 1968. 

113. , The Evaluation of Retrieval Systems , Comparative Systems 

Laboratory, TR-5, Western Reserve University, Cleveland, Ohio, 
July 1965. 

114. Robertson, S. E., "The Parametric Description of Retrieval Tests, 

Part II," Journal of Documentation, Vol. 23, No. 2, June 1969. 

115. Rosenberg, V., "A Study of Statistical Measures for Predicting 

Index Term Usage," Journal of ASIS, February 1971. 

116. Ruspini^ E. H., "A New Approach to Clustering," Information and 

Control , Vol. 15, 1969. 

117. Sal ton, G., Au^^matic Information Organization and Retrieval , McGraw 

Hill, New York, 1968. ~ 

118. ''The Evaluation of Automatic Retrieval Procedures- 
selected Results Using the SMART System," American Documentation, 
Vol. 16, No. 3. — ' 

119. ^ The SMART Project—Status Report and Plans: Reports 

on Evaluation, Clusters, and Feedback, Scientific Report No. 
ISR-12, Cornell University, New York, 1967. 

120. Saracevic, T., "Selected Results from an Inquiry into Testing of 

Information Retrieval Systems," Journal of ASIS, March-April 1971. 

121. Schultz, C, C. Schwartz, and L. Steinberg, "A Comparison of 

Dictionary Use Within Two Information Retrieval Systems," 
American Documentation, October 1961. 



183 



122. Schultz, L. (ed.). The Information Bazaar; Sixth Annual National 

CoHoquiuni of Information Retrieval > Mechanical Documentation 
Service, 1969. 

123. Sharp, J. R., Some Fundamentals of Information Retrieval , London 

House & Maxwell, New York, 1965. 

124. Shumway, R. H., "Some Estimation Problems Associated with Evaluating 

Information Retrieval Systems," Evaluation of Document Retrieval 
Systems , E. C. Bryant (ed.), Westat Research, Inc., December 31, 
1968. 

125. Soergel, D., "Mathematical Analysis of Documentation Systems," 

Information Storage and Retrieval Systems. Vol. 3, 1967. 

126. Snyder, M. B., et. a1., Methodology for Test and Evaluation of 

Document Retrieval Systems , Human Sciences Research, tttc.,- 
Maclean, Virginia, January 1966. 

127. Svenonius, E., "An Experiment in Index Term Frequency," Journal of 

ASIS, March-April 1972 ^ 

128. Swanson, D. R., "On Indexing Depth and Retrieval Effectiveness," 

Information System Sciences: Proceedings of the Second Congress, 
Hot Springs. Virginia, November 22-25. 1964, J. Spiegal and D. E. 
, Walker (eds.}. Spartan Books, Washington, D. C, 1965. 

129. , "Searching Natural Language Text by Computer." 
Science. Vol . 132, October 21, 1960. 

130. . "The Evidence Underlying the Cranfield Results," 

j The Library Quarterly , Vol. XXXV, 1965. 

i 131. Swets, J. A., "Information Retrieval Systems," Science, Vol. 141, 
i July 19, 1963. 

1 132. , Effectiveness of Information Retrieval Methods . Bolt, 

- Bemack. and Newman, June 1967. 

\ 133. Switzer, P., "Vector Images in Document Retrieval," Symposium on 
f Statistical Association Methods for Mechanical Documentation , 

I March 17-19. 1964. 

I 134. Szasz, G., Introduction to Lattice Theory , Academic Press, New 
I York, 1965: 

I 135. Taube, M., et. a1.. Studies in Coordinate Indexing , Vols. 1-4, 
I Documentation, Inc., Bethesda, Maryland, 1953-57. 

I 136. , and L. Heilprin, The Relation of the Size of the Question 

I to the Work Accomplished by a Storage and Retrieval System . 

I Documentation, Inc., Washington, D. C, August 1957. 

I 137. Tell, B. v., "Auditing Procedures for Information Retrieval Systems," 

Proceedings of the 1965 Congress International Federation for 
Documentation. 31st Annual Meeting; October 7-16, 1965 , Spartan 
Books, Washington, D. C, 1966. ^ 



138. Thome, R. G., "The Efficiency of Subject Catalogues and the Cost 
of Information Searches," Journal of Documentation. Vol. 11, 
September 1955. 



ERIC 



184 



139. Tinker, J. F., "Imprecision in Meaning Measured by Inconsistency 

in Indexing," American Documentation , April 1966. 

140. Vickery, B. C, On Retrieval System Theory , Butterworths , London, 

England, 1961. 

141. , "Bradford's Law of Scatterina," Journal of Docu- 

mentationT Vol . 4, No. 3, 1948. 

142. Voigt, M. J., "The Researcher and His Sources of Scientific Infor- 

mation," Libri , Vol. 9, 1959. 

143. Wall, E., "Further Implications of the Distribution of Index Term 

Usage," Proceedings of the American Documentation Institute, Vol. 
1. Parameters of information Science Annual Meeting ," Philadelphia, 
' Pennsylvania, October 5-8, 1964. 

144. i . , . » "Indexing Control , " TICA Conference , Drexel Insti- 
tute of Technology*, June 15-17, 1964, Spartan Books, Washington, 
D. C, 1964. s 

145. Heaver, W. , "Recent Contributions to the Mathematical Theory of 

Comnuni cation," The Mathematical Theory of Communications, C. E. 
Shannan and W. Weaver (eds.). University of Illinois Press, 
Urbana, Illinois, 1949. 

146. Webster, R. , "A Note on Dictionary Searching," Information Storage 

and Retrieval, Vol. 5, 1969. 

147. Uestat- Research, Inc., Procedural Guide for the Evaluation of 

Document Retrieval Systems , December 1968. 

•148. Wiederkehr, R. R. , "Search Characteristic Curves," Evaluation of 
Document Retrieval Systems , E. Bryant (ed.), Westat Research, 
Inc., December 31, 1968. 

149. Wiener, N., The Human Use of Human Beings; Cybernetics and Society , 

Houghton Mifflin, Boston, 1950. 

150. Yule, G. H., "On Measuring Association Between Attributes," Journal 

of the Royal Statistical Society , Vol. 75, 1912. 

151. Zadeh, L. A., Fuzzy Sets, Electronics Research Laboratory, Report 

No. 64-44, University of California, Berkeley, November 1964. 

152. Zunde,P.,and V. Slamecka, "Distribution of Indexing Terms for 

Maximum Efficiency of Information Transmission," American 
Documentation , April 1967. 

153. Zipf , G. K. , Human Behavior and the Principle of Least Effort , 

Addison Wesly, Cambridge, Massachusetts, 1949. 



185 



Appendix A 
GLOSSARY 



ERIC 



186 



GLOSSARY 



Boolean Algebra — a Boolean Algebra is defined as a distributive lat- 
tice in which each element "a" has a complement defined by its 
negation. This structure, for a defined set T and its elements 
(A,B,...), is defined in terms of the following operations. 
Conjunction; C = A*B, the subset of subclass of all iridex terms 
or elements of T that are both in the subsets of A and B. Dis- 
junction; D = A+B, the subset of all index terms Or elements of 
T which are either in subset A or isubset B. Negation; N = -B 
or B, the subset of all index terms in T which are not in subset 
B. 

Bradfords Law of Literary Yield or Scatter ~ if periodicals are 
ranked into N groups, each yielding the sam^ number of articles 
as a specified topic, the nurnber of periodicals in each group 
will increase geometrically, as per: Irnrn^. 

Coordinate Index — an index system in which the descriptor terms 
are manipulated. There are two classes of coordinate index 
systems : 

a) Pre-coordinate — those DRSs in which the coordination of 
the descriptors takes place during the inquiry generation 
process. 

b) Post-coordinate — those DRSs in which the coordination of 
the descriptors takes place during the inquiry generation 
process. 

Document — any discrete unit of information — articles, reports, 
recordings, etc. 

Document Retrieval Systems — a class of information retrieval systems 
solely concerned with the subject analysis of document content, 
the storage of a set of official surrogates "defining" document 
content, and the "mechanical" search of the surrogate set to 
identify or select those documents most "relevant" to a user's 
formal request. 

Facet Index — a composite index of an item by combining in a pre- 
scribed manner the terms derived from separate relational index- 
ing examinations. 

Indexing — the process in which documents are analyzed, and terms 
Indicating subject content are assigned or derived. 

Mandelbrot-Estoup-Zipff Relationship — the term frequency of use f(r) 
versus rank (r) distribution in a language is a decreasing con- 
vex function in log-log space, and is of the form: 



f(r) = K(r+B)-° 



187 



Uniterms, Keywords, Descriptors — words or word-pairs extracted from 
a document that, are used to identify the subject content of the 
document. 

Word Relationships — there are four: operational word relationship 
categories that can be employed in DRSs. 

(1) Semantic relationships which manifest the meaning and con- 
text of terms within a language, 

(2) Syntatic relationships which arise from terms as menters 
of word classes and with the class relationships in a - 
structural (grammatical) sense, 

(3) Syndetic relationships which measure the manner by which 
words that are conjunctively coordinated with a given 

or base term cross-reference one another, and 

(4) Statistical relationships which measure the frequency of 
occurrence of terms in a document. 

Zipf "Law" of Term Usage — the relationship between the frequency of 
use f(»*) of a term and its rank (r) in a language based on 
Zipf's Principle of Least Effort, and is^of the form: 

r 

f(r) = Kr"^ 



188 



Appendix B 



INSTITUTE OF LIBRARY RESEARCH - TEST SYSTEM CHARACTERISTICS 
0 Thesaurus Listing (Sample) 
0 Document Descriptor Listinq (Sample) 



189 



THESAURUS LISTING SAMPLE 



t 
\ 

I 



ERIC 



SUBJECT AUTHORITY LIST (98) 



ahareviation^ 

S » SEE 

Sk • SI:E ALSO 

• IN THE SENSE OF (I.E. SCOPE NOTE} 

♦ •'NO DOCUMENTS YET INDEXED KITH THIS TERM 

♦ • TERM NOT ALLOWED • KELATEO TERN TO 8E USED 



♦ASnREVlATION 
ABSTRACT 
ABSTRACTING 
ACCESS 

ACCESSION NUMBER 
ACCURACY 
ACQUISITION 
ADDRESS 

ADMINISTRATION 
ALCEBRA 

♦ ALGOL 

S PROG. LANGUAGE 
ALGORITHM 
ALPHABETIC 
ALPHABETIC ORDER 
ALPHANUMERIC 

* ALTERNATIVES 
AMBIGUITY 
ANALOGY 
ANALYSIS 
ANSWER 

♦ANTHOLOGY 

SA BIBLIOGRAPHY 
APPLICATION 
♦ARITHMETIC 

S MATHEMATICS 

ARRAY 
♦ARTICLE 

S DOCUMENT 
ARTIFICIAL INTEL 
ASSIGNED 
ASSOCIATION 
ASSOCIATIVE 



♦ATTRIBUTE 

S CHARACTERISTIC 
AUTHOR 

AUTHORITY LIST 

SA THESAURUS 

AUfU ABSIR ACTING 

AUTO. INDEXING 

AUTOMATIC 

AUTOMATION 

SA MECHANIZATION 



BATCH PROCESSING 

BIBLIOGRAPHIC 

BIBLIOGRAPHY 

SA ANTHOLOGY 
BINARY 

BOOK ' 

BOOLEAN 

SA LOGICAL 



CALL NUMBER 
CANONICAL 

SA NORMALIZED 

CARD 

CARD CATALOG 

CATALOG 

CATALOGING 

CATEGORIES 

CENTERS 

CENTRALIZED 

CHARACTERISTIC 



ERIC 



191 



CHEMICAL 
CIKCULATlON 
CItATION 
CITATION INDEX 
♦CLAIM 

$A " COPYHICHT 
SA «»ATENT 
CLASSIP. SCHEMA 
CLASSIFICATION 
CLERICAL 
♦CLUE MORO 

S KEVMORO 
CLUMP 
CLUSTER 
CO-nCCURRENCE 
♦COBOL 

S RROG* LANGUAGE 

CODE 

SN f'DlA DESIGNATION 
COOING 

SN COMPUTER CCCING 
CdEFFECIENT 
COLLECTION 
♦COLLOQUIUM 

SA CQNfERENCE 
i SA MEETING 

I SA SYMPOSIUM 

/ COMRINATirNS 
I ♦COMIT 

, "S^. PROG, LANGUAGE 

! COMMUNICATION 

I COMP LINGUISTICS 

; COMPARISON 

I COMPUTER 

i CONCEPT 

• CONCORCANCE 

t CONDITIONAL fRCB 

CONFERENCE 
! SA COLLOQUIUM 

j SA MEETING 

I SA SYMPOSIUM 

CONNECTION 
I ♦CONSKUTIVE 
I S ORDER 

♦CONSOLE 

S REMOTE TERMINAL 

CONTENT 

CONTENT ANALYSIS 
CONTEXT 
CONTROL 
CONTROLLED 
3 CONVENTIONAL 
CONVERSrCN 
COORDINATE 
COORDINATE INDEX 

SA UNITjERM SYSTEM 

ERIC 



* COPYRIGHT 

SA CLAIM 
SA PATENT 

♦CORE 

S STORAGE 
CORRaATION 
COST 
COUNT 
COUPLING 
CRANPTELD 
CRITERIA 
CRITICAL 

SN REVIEMINC* NOT VITAL 
CROSS REFERENCE 
CURRENT AMARENES 
CURRICULUM 
♦CUSTOMER 

S USER 



DATA 

•DECENTRALIZATION 
DECISION THEORY 
DEDUCTIVE 
DECREE 

DEPTH OF INDEX IN 

DESCRIPTIVE 

DESCRIPTOR 

SA KEVUORD 

SA TAG 

SA TERM 
DESIGN 

SA PLANNING 
DICTIONARY 
♦DIFFERENCE 

S COMPARISON 
♦DIGITAL COMPUTER 

S COMPUTER 
DISCRIMINANT 
♦OISPUY 

S REMOTE TERMINAL 
DISSEMINATION 
♦DISSERTATION 
DOCUMENT 

SA JOURNAL 
DOCUMENTATION 
DUAL DICTIONARY 



♦ECONOMICS 

S COST 
EDITING 
EDUCATION 
EFFECTIVENESS 

SA EFFICIENCY 




192 



^FFICfENCV 

SA CFFECTIVfNESS 
♦FIECTRONIC COIIFUTER 

S COi^FUtED 
♦FMFIIIICAl 

S FX»E*|MENT 
♦FNCaOINK 

S COOING 
FNTROFV 
ENTRY 

SN ACCESS ROtNT 

FRROR 

EVAlUATirM 

SA TFST 

SA tiTlllTV 

SA VALUE 
rXFFRIMENT 
EXTRACT 



FACET 

TACFTEO ClASSIF. 
FACT RETRIEVAL 
♦r ACTOR ANALYSIS 

S STAT. NETHOD 
FALSE OROF 
FEFOBACR 
FILE 

SA LIST 

SA STRING 
FILE ORGANUATin 
FLCM OF INFQ. 
FORNAT 
♦FORTRAN 

S FROG* LANGUAGE 
FREQUENCY 
FUNCTION 

SN OFERATItNAL* NCT 
MATHNATICAL 



GENERAL 
GENFRATION 

SN FROOUCTICN 
GENERIC 
♦GOAL 

S ORJECTIVE 
nnVFRNNENT 
fttAMNAR 
GRAFH 

SN MATHE<tATICAL GRAFN 
SA TABLE 
GRAFHICS 

SN 6RAFHIC MATERIAL! E."* 

FHorcs. 



♦GRQUF 

S CLUNF 



HARDMRE 

SN C0I4RUTERS* HfCRnFUH 

EQUIFKENT, FTC. 
SA HECHANICAL 
♦HEAOnCS 

S SUtJECT HEADING 
HIERARCHY 
HISTORICAL 
♦HtlMAN 

S MANUAL 
♦HUMAN INCEXINC 

S MANUAL lf«OEXING 



«IOENTICAL 

IDENTIFICATION 

ILLUSTRATION 
♦IMFLENENTATIRK 

INOEFENDENT 

INDEX 

INDEXING 

INFERENCE 

INFO. RETRIEVAL 

INFO. SCIENCE 

INFORMATION 

INFUr 
♦INQUIRER 

S USER 
♦INQUIRY 

S QUESTION 
♦INSTRUCTION 

S EDUCATION 

INTELLECTUAL 

INTEROISCIFLIKAR 

INTERFACE 

INTERFRET 
♦INTERROGATE 

S QUESTION 
♦INTERSECTION 

S VENN DIAGRAM 

INTROOUCTCRY 

INTUITIVE 

INVENTORY 
* INVERT ED 

IRRELEVANT 
♦ ITEM 

S DOCUMENT 
ITERATIVE 

SA RECURSIVf 



193 



JQUf^NAL 

SA OOCU'^ENT 



KEYPUNCH 
KEYWORD 

SA OESCRIPTCR 

SA TAG 

SA TERN 

KHIC 



LANGUAGE 
LARGE 
LATTICE 
LAM 
♦LEVEL 

S DEGREE 
♦LEXICAL 

S ALPHABETIC 
♦LEXICDN 

S DICTIONARY 
LIBRARIAN 
. LIBRARY 
LINGUISTIC 
LINK . 
LIST ^ 

SA FILE 

Sa STh iNG 
LITERATURE 
LOGIC 
LOGICAL 

SA BOOLEAN 



♦MACHINE 

S HARDWARE 
NACHINE-REAOABLE* 
♦ MAGNETIC TAPE 

S STORAGE 
MllN-MACHINE 
MANUAL 

MANUAL INDEXING 
MATCH 

MATHEMATICAL 
MaTHE««AT ICS 

SA PROBABILITY 
^•ATRIX 
MEANING 
MEASURE 
MPCHANICAL 

SA HARDWARE 
MECHANIZATION 

SA AUTOMATION 
••EDIUH 



•'EETING 

SA COLLOQUIUM 
SA CONFERENCE 
SA SYMPOSIUM 
♦MEMORY 

S STORAGE 

METHCODLDGY 
♦ METR IC 

S MEASURE 

MICROFICHE 

••I CROP ILM 

MODEL 

SA SIMULATION 
MODIFICATION 
MULTIPLE 



NATIONAL 
NATURAL 

NATURAL LANGUAGE 

NEEDS 

NETWORK 

SN ORGANIZATIONAL STRUCTURE 
SA ORGANIZATION 

NOISE 
♦NOMENCLATURE 

S NOTATION 
NON-CONVENTIONAL 
nON-OI SCk i Ml nAnT 
NON-FILE 
NON-RANDGM 
NON-RELEVENT 
♦ NORMALIZED 

SA CANCKICAL 
NOTATION 

SA TERMINOLOGY 
NUMBER 
NUMERIC 



OBJECTIVE 

^Sri GDALf NOT AS OPPOSED 
TO SUBJECTIVE 
♦OCCURRENCE 
OFF-LINE 
ON-LINE 
OPERATION 
OPTIMIZATION 
ORDER 

ORGANI ZATIDN 

SA NETWORK 
OUTPUT 



♦ PAIR 

S WORD ASSOCIATION 



194 



S nnCUHfrNT 
Si VARIABir. 

PARSE 
PATENT 

. SA CLAIM 

SA COPYRIGHT 
PATTERN 
PERrnRMANCE 
^►PERIOOICAL 

S JOURNAL 
PERMUTED 
PERTINENT 

SA RELEVANT 
PHILOSOPHY 

SA POLICY 

4^ PHOTO 

S GRAPHICS 
PLANNING 

SA DESIGN 

♦PLOT 

S GRAPH 
♦POLICY 

SA PHILCSCPHY 

♦ PHPULATI ON 

S COLLECT ION 
wtClbl UN 
PREDICTION 
♦PRINCIPLE 

♦ PRINT-OUT 

-~ S OUTPUT. 

PRINTING 
♦PRIVACY 

S SFCI^ECY 
PROBABILITY 

SA MATHEMATICS 
PROCEDURE 

RDCEEOINGS 
PROCESSING 
PROFILE 
PROG* LANGUAGE 
PROGRAM 

SN COMPUTER PRCGRAH 
SA ROUTINE 
SA SOFTHAPE 
SA SUBROUTINE 
PROGRAMMED 
♦PROPERTY 

^S CHARACTERISTIC 
PSYCHOLOGY 
♦PUBLICATION 

S POCUMENT 
PUNCHED 
♦PUNCMED-CARO 

S STORAGE 



PUNCTUATION 
♦PURPOSE 

S OBJECTIVE 



QUALITATIVE 

SA SUBJECTIVE 
QUANTITATIVE 
♦QUERY 

S QUESTION 
QUESTION 

. SN BOTH NCUN AKC VERB 
OUESTI ON-ANSMER. 



RANDOM 

RANDOM-ACCESS 
RANK 
READING 
REAL-TIME 
RECALL 
RECOGNITION 
RECORD 
♦RECORDED INFO* 

S RECORC 
RECURSIVE 

SA ITERATIVE 

REFERENCE 
♦REJECT ION 
RELATED 
RELATIONSHIP 
RELATIVE 
RELEVANCE 
RELEVANT 

SA PERTINENT 
♦REMOTE TELETYPES 

S REMOTE TERMINAL 
REMOTE TERMINAL 

SA . VISUAL DIS« CON« 
♦REPORT 

S DOCUMENT 
♦REQUEST 

S QUESTION 
RESEARCH 
♦RESPONSE 

S ANSWER 
RESPONSE TIME 
RETRIEVAL 
RETRIEVAL SYSTEM 
REVIEW 

SA SUMMARY 

SA SURVEY 

ROLE 



195 



ROUTINE 
SN 
SA 
-SA 
SA 



COMPUTER OCUTINE 

PROGRAM 

SOFTWARE 

SUBROUTINE 



SAMPLE 
SCANNING 
SCIENTIFIC 
SCOPE NOTE 
SEARCH CRITERIA 
SEARCH STRATEGY 
SEARCHING 

♦ SECRECY 
SEE ALSO 

SN AS USFD 
SEE-REFERENCE 
SELECTION 
SELECTIVE niSSEM 
SEMANTIC 

SA SYNTAX 
SEQUENCE 

♦ SERIAL 

S JOURNAL 
SERVICE 
SET THEORY 
StTS 

SHELFLIST 

SIGNIFICANCE 

SIMULATION 

SA MODEL 

SIZE 
SMALL 
. SOCIAL IMPLIC, 
SOFTWARE 
SA 



IN CATALOGING 



PROGRAM 
ROUTINE 
SUBROUTINE 



SA 

SA 
SORTING 
SOURCE 
SPECIALIZED 
SPECIFICITY 
STANOAROIZATICN 
STAT ASSOCIATION 
STAT. ANALYSIS 

SA STAT. METHOD 
STAT, METHOD 

SA STAT. ANALYSIS 
STATE-OF-THE-ART 
STATISTICAL 
♦ STOCHASTIC 

S RANDOM 
STORAGE 



STRING 

SA FILE 

SA LIST 
STRUCTURE 
SUBJECT 

SUBJECT HEADING 
SUBJECT INDEXING 
SUBJECT-CATALCG. 
♦ SUBJECTIVE 

SA QUALITATIVE 
SUBROUTINE 



SA 


PROGRAM 


SA 


ROUTINE 


SA 


SOFTWARE 


SUMMARY 




SA 


REVIEW 


SA 


SURVEY 


SURVEY 




SA 


REVIEW 


SA 


SUMMARY 


SYMBOL 




SYMBOLIC 


LOGIC 



COLLOQUIUM 
CONFERENCE 
MEETING 



SYMPOSIUM 

SA 

SA 

SA 
SYNONYM 

SYNTACTIC ANAL. 

SVNTaK 

SA SEMANTIC 
SYSTEM 



TABLE 



TAG 



SA GRAPH 

SA DESCRIPTOR 
SA KEYWORD 
SA TERM 

♦TAPE 

S STORAGE 
♦TEACHING 

S EDUCATION 
TECHNICAL 
TECHNICAL REPORT 
TECHNOLOGY 
TELEGRAPHIC ABS. 
TERM 

SA DESCRIPTOR 

^SA KEYWORD 

SA TAG 
♦TFRMINAL 

S REMOTE TERMINAL 
TERMINOLOGY 

SA NOTATION 



196 



T?rST 

SA f:V4LUATinN 
SA UTILITY 
SA VALUE 

TEXT 

THEORY 

THESAURUS 

SA AUTHORITY LIST 

TIME 

TIME-SHARING 
^ TITLE 
♦TOPIC 

S SUBJECT 
TRANSFORM AT ICK 
TRANSLATION 
♦T9ANSL ITERATION 
TRANSMISSION 
TREE 

TREE STRUCTURE 

TRUNCATION 
♦TYRE STYLE 

TYPE-SETTING 
♦TYPOGRAPHICAL 



MfAWJ INDEXING 
UORO 

UCRO ASSCCIATICN 
WORD FREQUENCY 
♦WORD PAIRS 

S UORO ASSOCIATION 



«^UNION 

SN SET THEORY UNION 

S VFNN CfAC^^AH 
♦UNION CATALOG 
♦^UNITERM 

S DESCRIPTOR 
UNITERM SYSTEM 

SA COORDINATE INDEX 
UPDATING 
USER 
UTILITY 

SA EVALUATION- 

SA TEST 

SA VALUE 



VALIDATION 
VALUE 

SA EVAUATION 

SA TEST 

SA UTILITY 
VARIABLE 

SA PARAMETER 
VECTOR 

VENN DIAGRAM 
♦ VISUAL DIS« CCh. 

SA REMOTE TERMINAL 
VOCABULARY 



WEIGHT 



197 



DOCUMENT DESCRIPTOR LISTING (98) 



L ^ L 



A013101LOACCESS 

A013I02L00ATA 

A013i03LDLIST 

AOI31tHLOMOG. LANGUAGE 

A013105L0STIIING 

A013106L0VAIIIA6LE 

AO13201L0ACCFS$ 
AO13202L0C0NTEXT 
A013203L0GIIAHNAII 
A0132(HL0NATIIRAL LANGUAGE 
AOl3205L0ftfLEVANT 
AO13206L0SYNTACTIC ANAL. 
A013207LCTMNS^CIIMAT ION 

AO13301L0AflSrRACTINC 
AO13M2L0C0NFFIIENCE 
AO1330U0LINGUIST1C 
A0133<KLOPARSE 
AO13305L0SYNB0LIC LOGIC 

A013^0ILOALGOII|THN 

A0134a2L0INTERPIIET 

AO13M3L0NOISF 

AO13404L0IIE0UNOANCY 

A013405L0SYSTM 

AO13501L0ACCE^S 

Ai>M^o:»i nnnni«<ENT 

AO13503L0LIBRAftlAN 
AO13$04L0IIESEAIICH 
AO 1 350SL0T ECHNOLOG Y 

A013601L0ACQUISiri0N 

A013602L0LIBRARY 

A013603LDRETRIEVAL 

A0l37niL0ACCESSI0N NUMBER 
AD137D2L0RETRIEVAL 

BO612OILDAUTQ ABSTRACTING 
fl00120?LDLINGUI.STIC 
^00 1 203L0T R ANSL AT ION 

B001301LDABSTRACTING 
B00130?LOOICT lONARY 
B001303L0LIBRARY 

BOO1401L000CUMENT 
B001402L0$CANNING 

BOOlSOlLOAUTOMATinN 
B001502L0INFa. RFTRIFVAL 
R001503L0QUESTICN 



ALGOP ITHN 
FILE 

NOTATION 
PROGRAM 

STRUCTURE 



ALGORITHM 
DATA 

INFO. RETRieVAL 
OUTPUT 
SfrM ANTIC 
SYNTAX 



ASSIGNED 

INFORMATION 

OPERATION 

SETS 

SYNTAX 



COMMUNICATION 

FiCM OF INFO. 

INFORMATION 

PARSE 

STORAGE 

SYSTEM 



COST 

LANGUAGE 
PROCEDURE 
STORAGE 
SYSTEM 



COMPUTER 

GENERATION 

INTERPRET 

QUESTION-ANSMER 

SURVEY 

TIME-*SHARING 



ALGORITHM 


ANALYSIS 


COMP LINGUISTICS 


EDITING 


EVALUATION 


INFO. RETRIEVAL 




LOGIC 


MATCH 


NATURAL LANGUAGE 


PRDG. LANGUAGE 


PROGRAM 


QUESTION-ANSMER 




TFCHNICAL 


TIME-SHARING 


TRANSLATION 




COMPUTER 


CONFERENCE 


ERROR 




MAN-l'ACHINe 


MATHEMATICAL 


NATURAL LANGUAGE 


NOTATION 


PROG. LANGUAGE 


PROGRAM 




SEMANTIC 


SOFTWARE 


SYNTAX 




TRANSLATION 


USER 


tIDRO 




BIBLIOGRAPHY 


CENTERS 


CIRCULATION 




FLOW OF INFO. 


GENERAL 


INFO. RETRIEVAL 




LIBRARY 


MECHANIZATION 


RfeMOIfc ifcKMtriAL 




SCIENTIFIC 


SEARCHING 


SERVICE 




TRANSMISSION 








ANALYSIS 


CIRCULATION 


COMMUNICATION 




MEASURf 


MEETING 


PATTERN 




SERVICE 


SYSTEM 






BOOR 


CLASSIFICATION 


LIBRARY 




SIfE 


SUBJECT 






BIBLIOGRAPHIC 


COMMUNICATION 


LANGUAGE 




NATURAL 


STORAGE 


SYSTEM 




ASSOCIATION 


CLASSIFICATION 


DATA 




FREQUENCY 


INDEX 


INFORMATION 




LITERATURE 


MICROFILM 


NETWORK 




INDEXING 


INFO. RETRIEVAL 


MICROFILM 




STORAGE 


TERM 


TRANSLATION 




COMMUNICATION 


DISSEMINATION 


DOCUMENT 




INFORMATION 


INPUT 


OUTPUT 




RETRIEVAL 


SIGNIFICANCE 


THESAURUS 





198 



Appendix C 

SAMPLE DATA BASE CHARACTERISTICS 

0 Term Frequency of Use Listing 

0 Depth of Indexing Listing 

. 0 Term* Document Matrix in Condensed 
Array Format 



199 



TERM FREQUENCY OF USE FOR SAMPLE DATA BASE 



lenn 


use 


Term 


Use 


Term 


Use 






51 


4 


102 


13 


1 


0 


52 


3 


103 


3 


2 


4 


53 


0 


104 


15 


3 


1' 


54 


6 


105 


2 






55 


20 


106 


1 


5 


0 


56 


1 


107 


0 


6 


0 




5 


108 


32 


7 


c 


58 


7 


109 


1 


8 


2 


59 


8 


110 


0 


^ 9 


0 


60 


6 


111 


2 


10 


5 


61 


5 


112 


3 


11 


9 


62 


9 


113 


4 


12 


2 


63 


5 


114 


10 


13 


0 


64 


0 


• 115 


3 


14 


0 


65 


1 


- 116 


3 


13 


0 


66 


10 


117 


3 


16 


2 


67 


1 


118 


26 


17 


1 


68 


4 


119 


12 


18 


14 


69 


27 


120 


1 


19 


2 


70 


7 


121 


1 


20 


0 


71 


0 


122 


3 


21 


3 


72 


0 


123 


1 


22 


1 


73 


1.^ 


124 


3 


23 


0 


74 


1 


125 


5 


24 


0 


75 


5 


126 


8 


25 


12 


76 


4 


127 


2 


26 


3 


77 


3 


128 


3 


27 


1 


78 


2 


129 


1 


28 


1 


79 


3 


130 


11 


29 


3 


80 


1 


131 


3 


30 


8 


81 


0 


132 


2 


31 


12 


82 


4 


133 


1 


32 


1 


83 


10 


134 


4 


33 


0 


84 


0 


135 


0 


34 


5 


85 


6 


136 


6 


35 


4 


86 


6 


137 


8 


36 


1 


37 


1 


138 


0 


37 


4 . 


88 


1 


139 


8 


38 


5 


89 


3 


140 


7 


39 


1 


90 


3 


141 


1 


. 40 


"0 


91 


1 


142 


0 


41 


1 


92 


tm 


143 


0 


42 


1 


93 


2 


144 


3 


43 


3 


94 


2 


145 


0 


44 


t 


95 


5 


146 


0 


45 


5 


96 


L 


147 


15 


* 46 


2 


97 


2 


148 


26 


4T ' 


' 0 ^ 


98 


0 


149 


3 


48 


2 


99 


6 


150 


21 


49 


2 


100 


1 


151 


4 


50 


1 


101 


1 


152 


22 




200 



Term 


Use 


Tenti 


Use 


Temi 


Use 


153 


14 


204 


4 


255 


1 


154 


2 


203 


3 


256 


1 


155 


1 


206 


0 


257 


13 


156 


1 


207 


1 


256 


2 


157 


1 


208 


1 


259 


5 


158 


4 


209 


1 


260 


1 


159 


1 


210 


1 


261 


1 


160 


1 


211 


C 


262 


8 


161 


1 


212 


4 


263 


0 


162 




213 


1 


264 


0 


163 


. 1 


214 


1 


265 


15 


164 


4 


215 


1 


266 


2 


165 


0 


Z16 


0 


267 


20 


166 


6 


217 


1 


268 


12 


167 


4 


218 


3 


269 


4 


168 


13 


219 


3 


270 


8 


lfe9 


1 


220 


1 


271 


0 


170 


5 


221 


5 


272 


21 


171 


0 


222 


1 


273 


14 


172 


2 


223 


5 


274 


0 


173 


8 


224 


0 


275 


3 


174 


-3 


225 


2 


276 


0 


175, 


2 


226 


0 


277 


1 


176 


3 


227 


2 


273 


2 


177 


7 


228 . 


7 


fr-^^ 279 


3 


178 


3 


229 


3 


280 


7 


179 


1 


230 


r 


281 


0 


180 


1 


231 


1 


282 


2 


181 


4 


232 


0 


253 


9 


182 


3 


233 


10 


284 


24 


183 


1 


234 


5 


285 


1 


184 


6 


235 


0 


286 


0 


185 


6 


236 


0 


287 


0 


186 


5 


237 


9 


288 


r 


1 AT 


11 




4 


281 


1 


168 


3 


239 


4 


290 


13 


189 


17 


240 


8 


291 


4 


190 


2 


241 


2 


292 


3 


191 


3 


242 


3 


293 


1 


192 


1 


243 


8 


294 


4 


193 


0 


244 


0 


295 


0 


194 


4 


245 


1 


296 


1 


195 


0 


246 


2 


297 


0 


196 


1 


247 


0 


298 


0 


197 


6 


248 


1 


290 


1 


198 


1 


249 


3 


300 


1 


190 


0 


250 


17 


301 


3 


200 


1 


251 


5 


302 


1 


201 


1 


252 


4 


303 


3 


202 


7 


253 


1^ 


304 


0 


^ 203 


2 


254 


6- 


305 


3 



201 



Term 


Use 


Term 


Use 




X 




1 J 

IC 


CI A 7 


IL 




C. 


^ AO 


t\ 
Sj 




•J ^ - 


uno 




^*^€. 


O 


•J 1 

^ iU 


i 




A 


'x^ \ 
jki 


1 A 




C 
I> 






^•t */ 


A 

w 




1 
1 




1 . 

4 


314 


12 


347 


3 


315 


6 


348 


3 


316 


4 


349 


0 


317 


6 


350 


0 


318 


0 


351 


0 


IXB 


1 


352 


0 


320 


0 


353 


0 


32i 


3 


354 


6 


322 


7 


355 


0 


323 


1 


356 


11 


324 


3 


357 


4 


325 


3 


358 


1 


326 


6 


359 


6 


121 


6 


360 


2 


328 


22 


361 


2 


329 


2 


362 


0 


330 


2 


363 


0— 


331 


5 


364 


5 


332 


1 


365 


7 


333 


5 


366 


4 


334 


0 


367 


7 


335 


1 


368 


9 


336 


5 


369 


3 


337 


4 






338 


9 







er|c . I 

J -A 



202 



DEPTH OF INDEXING DISTRIBUTION 



Depth of Depth of 

Document Indexing Document Indexinq 



1 


27 


52 


37^ 


2 


11 


53 


12 


3 


13 


54 


14 


4 


14 


55 


19 


5 


10 


56 


3 


6 


lo 


57 


8 


7 


12 


53 


IC 


8 


16 


59 


12 


9 


12 


60 


10 


10 


15 


61 


10 


11 


16 


62 


16 


12 


15 


6," 


8 


13 


^6 




17 


14 


12 


65 


12 


13 


15 


66 


15 


16 


1 9 

\c 


67 


2 


17 


11 


68 


15 


lo 


1 A 
lU 


69 


23 


19 


11 


70 


17 


20 


1 


71 


12 


21 


19 


72 


10 


22 


16 


73 


14 


23 


24 


74 


9 


24 


21 


75 


9 


25 


15 


76 


14 


26 


13 


77 


10 


27 


14 


78 


11 


CO 


19 


79 


14 


29 


13 


80 


15 


•30 


24 


81 


16 


31 


18 


82 


8 


32 


12 


83 


11 


33 


17 


84 


11 


34 


18 


85 


14 


35 


14 


86 


15 


36 


14 


87 


9 


37 


25 


88 


21 


33 


8 


89 


14 


39 


10 


90 


14 


40 


12 


91 


2 


41 


14 


92 


14 


42 


3 


93 


20 


43 




94 


12 


44 


17 


95 


19 


. .45 


26 


96 


' 15 


46 


13 


97 


19 


47 


It 


98 


12 


43 


34 


99 


13 


49 


22 


100 


3 


50 


IS 


101 


16 


51 


10 


102 


13 



203 



TERM - DOCUMENT MATRIX FOR SAMPLE DATA 
BASE - IN CONDENSED ARRAY FORM 



XXX mm- !ffwnnrnr i \u 

Interpret as document XXX Is assigned descHptor ZZZ. 





1 18 


1 


2 29 


1 


3 30 


1 


4 78 


1 


5 79 


I 


6 86 


1 


7 91 


1 


O Ave 


1 


9118 


1 


10120 


1 


11125 


1 


12130 


1 


13148 


I 


14149 


1 




1 
1 


1 Al 70 
AO A f V 


1 


17177 


1 


18185 


1 


19191 


1 


20262 


1 


21267 


I 


22285 


1 
& 


23290 


1 
& 


94391 


1 


25322 


1 


26338 


1 


27343 


2 


1 34 


2 


2 57 


2 


3 68 


2 


4 88 


2 


^ wV 


2 


6108 


2 


7118 


2 


8144 


2 


9248 


2 


10265 


2 


11336 


3 


1 «ft 




9 94 


3 


3 62 


3 


4 70 


3 


5 85 


3 


6118 


3 


7119 


3 


8149 


3 






lOlftS 


3 


11186 


3 


12314 


3 


13356 


4 


1 30 


4 


2 55 


4 


3 69 




» • ^ 








6119 


4 


7148 


4 


8149 


4 


9166 


4 


10265 


4 


11290 








& ^ J & 9 




14366 


5 


1 34 


5 


2 51 


5 


3108 


5 


4130 


5 


5144 


5 


6204 


5 


7227 


5 


8265 


5 


9280 


5 


10311 


6 


1 25 


6 


2 31 


6 


3104 


6 


4118 




9 & lb 9 


6 


6137 


6 


7140 


6 


8148 


6 


9153 


6 


10189 


6 


11198 


6 


12220 




13233 


h 


U257 


6 


15267 


6 


16272 


6 


17284 


6 


18311 


7 


1 86 


7 


2 97 


7 


3 99 


7 


>1U 


7 


5118 


7 


6124 


7 


7150 


7 


8189 


7 


9267 


7 


10273 


7 


11328 


7 


12359 


8 


1 32 


8 


2 55 


8 


3094 


8 


4112 


8 


5131 


8 


6168 


8 


7173 


8 


8215 


8 


9228 


8 


10231 


8 


11270 


8 


12300 


8 


13311 


8 


14322 


8 


15333 


8 


16357 


9 


1 25 


9 


2 48 


9 


3 62 


9 


4 85 


9 


5 99 


9 


6189 


9 


7249 


9 


8252 


9 


9265 


9 


10311 


9 


11331 


9 


12360 


10 


1 61 


10 


2 66 


10 


3 ¥6 


10 


4115 


10 


5117 


10 


6126 


10 


7152 


10 


8186 


10 


9189 


10 


10205 


10 


11237 


10 


12268 


10 


13254 


10 


14291 


10 


15338 


11 


1 18 


11 


2 31 


11 


3 55 


11 


4 62 


11 


5 76 


11 


6 83 


11 


7105 


11 


8108 


11 


9130 


11 


10131 


11 


II367 


11 


12311 


11 


13356 


11 


14359 


11 


15365 


11 


16367 


12 


1 21 


12 


2 31 


12 


3 55 


12 


4 57 


12 


5 58 


12 


6 59 


12 


7 62 


12 


8 95 


12 


9118 


12 


10170 


12 


1 1 187 


12 


12243 


12 


13272 


12 


14338 


12 


15339 


13 


1 2 


13 


2 11 


13 


3 19 


13 


4 69 


13 


5 93 


13 


6106 


13 


7108 


13 


8125 


13 


9130 


13 


10137 


13 


11152 


13 


12164 


13 


13177 


13 


14184 


13 


15194 


13 


16205 


13 


17237 


13 


18241 


13 


19273 


13 


20259 


13 


21296 


13 


22329 


13 


23359 


13 


24365 


13 


25366 


13 


26368 


14 


1100 


14 


2102 


14 


3108 


14 


4119 


14 


5137 


14 


6188 


14 


7189 


14 


8233 


14 


9267 


14 


10272 


14 


11280 


14 


12283 


15 


1 31 


15 


2 49 


15 


3 55 


15 


4119 


15 


5126 


15 


6139 


15 


7152 


15 


8176 


15 


9250 


15 


10253 


15 


11256 


15 


12269 


15 


13284 


15 


14328 


15 


15341 


16 


1 70 


16 


2 83 


16 


3118 


16 


4119 


16 


5175 


16 


6189 


16 


7233 


16 


8257 


16 


9267 


16 


10272 


16 


11275 


16 


12283-17 


1 10 


17 


2 38 


17 


3 55 


17 


4108 


17 


5150 


17 


6170 


17 


7185 


17 


8189 


17 


9267 


17 


10272 


17 


I12I2 


18 


1 18 


18 


2 51 


18 


3 52 


18 


4 60 


18 


5177 


18 


6230 


18 


72^5 


18 


8280 


18 


9342 


18 


10356 


19 


1 18 


19 


2 48 


19 


3 69 


19 


4115 


19 


5152 


19 


6185 


19 


7189 


19 


8197 


19 


9237 


19 


10338 


19 


11339 


20 


1 31 


20 


2 49 


20 


3 69 


20 


4153 


20 


5243 


20 


6283 


20 


7284 


21 


1 41.21 


2 54 


21 


3 60 


21 


4108 


21 


5109 


21 


6124 


21 


7130 


21 


8150 


21 


9152 


21 


10212 21 


11221 


21 


12246 


21 


13250 


21 


14252 


21 


15268 


21 


16284 


21 


17291 


21 


18312 


21 


19328 


22 


1 21 


22 


2 57 


22 


3 86 


22 


4114 


22 


S127 


22 


6140 


22 


7150 


22 


8169 


22 


9197 


22 


10219 


22 


11237 


22 


12249 


22 


13270 


22 


14284 


22 


15340 


22 


16347023 


1 11 


23 


2 18 


23 


3 31 


23 


4 55 


23 


5 57 


23 


6 58 


23 


7 69 


23 


8104 


23 


9108 


23 


10118 


23 


1*148 


23 


12152 


23 


13166 


23 


14168 


23 


15202 


23 


16223 


23 


17240 


23 


18242 



204 



5^ 

23 


19267 




20272 


23 


21273 


23 


9 9 9 1 #\ 

22310 


23 


23311 


23 


24321 


9 «* 

32 


25326 


23 


26339 


23 


27341 


23 


28344 


24 


1 26 


24 


2 .82 


24 


3 86 


24 


4 95 


24 


5114 


24 


6118 


2^. 


7119 


24 


8147 


24 


9150 


24 


10152 


24 


11184 


24 


12197 


24 


132 3 


24 


14228 


24 


15233 


24 


16257 


24 


17283 


24 


18284 


24 


19328 


24 


20356 


24 


21357 


25 


1 25 


25 


2 30 


25" 


^3 69 


25 


4108 


25 


5111 


25 


6118 


25 


7148 


25 


8153 


25 


9189 


25 


1026B 


25 


11272 


25 


12257 


25 


13284 


25 


14328 


25 


15365 


26 


1 25 


26 


2 59 


26 


3130 


26 


4137 


26 


5147 


26 


6185 


26 


7189 


26 


8233 


26 


9254 


26 


10268 


26 


11272 


26 


12311 


26 


13339 


27 


1 4 


27 


2 79 


27 


3104 


27 


4132 


27 


5136 


27 


6150 


27 


7174 27 


8202 


27 


9260 


27 .10328 


27 


11338 


27 


12343 


27 


13356 


27 


14357 


28 


1 10 


28 


2 28 


28 


3 66 


28 


4137 


28 


5152 


28 


6186 


28 


7187 


28 


6204 


28 


9265 


28 


10293 


28 


11314 


28 


12331 


28 


13338 


29 


1 4 


29 


2 66 


29 


3 69 


29 


4126 


29 


5158 


29 


6218 


29 


7219 


29 


8243 


29 


9269 


29 


10292 


29 


11328 


29 


12341 


29 


13357 


30 


1 4 


30 


2 11 


30 


3 66 


30 


4 69 


30 


5 75 


30 


6128 


30 


7133 


30 


8136 


30 


9150 


30 


10152 


30 


III57 


30 


12202 


30 


13223 


30 


14225 


30 


15251 


30 


16268 


30 


17290 


30 


18312 


30 


19321 


30 


20326 


30 


21327 


30 


22328 


30 


23341 


30 


24343 


31 


1 4 


31 


2 35 


31 


3 46 


31 


4 50 


31 


5108 


31 


6128 


31 


7132 


31 


8150 


31 


9172 


31 


10173 


31 


11191 


31 


12269 


31 


13270 


31 


14280 


31 


15284 


31 


16292 


31 


17333 


31 


18346 


32 


1 3 


32 


2 25 


32 


3 55 


32 


4 9t> 


32 


5104 


32 


6130 


32 


7147 


32 


8152 


32 


9173 


32 


10177 


32 


11196 


32 


12204 


33 


1 18 


33 


2 38 


33 


3 55 


33 


4 61 


33 


5 75 


33 


6148 


33 


7152 


33 


8153 


33 


9168 


33 


10223 


33 


11250 


33 


12265 


33 


13275 


33 


14284 


33 


15280 


33 


16314 


33 


17322 


34 


1 18 


34 


2 31 


34 


3 61 


34 


4 63 


34 


5 66 


34 


6104 


34 


7108 


34 


8114 


34 


9147 


34 


I0I5I 


34 


11177 


34 


I2I9I 


34 


13238 


34 


14268 


34 


15279 


34 


16284 


34 


I73II 


34 


18343 


35 


1 55 


35 


2 60 


35 


3 66 


35 


4108 


35 


5147 


35 


6152 


35 


7153 


35 


8176 


35 


9223 


35 


10272 


35 


11284 


35 


12312 


35 


13315 


35 


14328 


36 


1 82 


36 


2 96 


36 


3116 


36 


4134 


36 


5139 


36 


6140 


36 


7147 


36 


8150 


36 


9170 


36 


10187 


36 


II22I 


36 


12265 


36 


13267 


36 


14272 


36 


16279 


36 


16290 


36 


17312 


36 


18314 


36 


19328 


37 


1 12 


37 


2 25 


37 


3 35 


37 


4 37 


37 


5 60 


37 


6 63 


37 


7 80 


37 


8 83 


37 


9102 


37 


10108 


37 


11148 


37 


12164 


37 


13176 


37 


14177 


37 


15182 


37 


16187 


37 


17188 


37 


18190 


37 


19262 


37 


20272 


37 


21312 


37 


22322 


37 


23329 


37 


24339 


37 


25354 


38 


1 4 


38 


2 61 


38 


3102 


38 


4239 


38 


5272 


38 


6284 


38 


7314 


38 


8328 


39 


1 90 


39 


2102 


39 


3108 


39 


4111 


39 


5223 


39 


6234 


39 


7239 


39 


8250 


39 


9284 


39 


10328 


40 


1 25 


40 


2 69 


40 


3 82 


40 


4130 


40 


5147 


40 


6184 


40 


7250 


40 


8266 


40 


9267 


40 


10272 


40 


II3II 


40 


12325 


41 


1 45 


41 


2 55 


41 


3 63 


41 


4117 


41 


5159 


41 


6186 


41 


7234 


41 


8237 


41 


9252 


41 


10278 


41 


11309 


41 


12311 


41 


13358 


41 


14359 


42 


1 34 


42 


2 52 


42 


3152 


43 


1 35 


43 


2 51 


43 


3 52 


4? 


4118 


43 


5153 


43 


6177 


43 


7262 


43 


8309 


44 


1 2 


44 


2 18 


44 


3 45 


44 


4 54 


44 


5 55 


44 


6 76 


44 


7 85 


44 


8108 


44 


9147 


44 


10148 


44 


11166 


44 


12187 


44 


13188 


44 


14234 


44 


15237 


44 


16265 


44 


17330 


45 


1 4 


45 


2 70 


45 


S 75 


45 


4 77 


45 


5102 


45 


6108 


45 


7122 


45 


8134 


45 


9147 


45 


10150 


45 


II 167 


45 


12168 


45 


13221 


45 


14250 


45 


15257 


45 


16262 


45 


17267 


45 


18284 


45 


19294 


45 


20305 


45 


21315 


45 


22327 


45 


23328 


45 


24360 


45 


25364 


45 


26366 


46 


1 16 


46 


2 18 


46 


3 55 


46 


4 99 


46 


5136 


46 


6223 


46 


7234 


46 


8239 


46 


9312 


46 


10314 


46 


11322 


46 


12327 


46 


13348 


47 


1 II 


47 


2 18 


47 


3 77 


47 


4 82 


47 


5136 


47 


6137 


47 


7158 


47 


8168 


47 


9178 


47 


10225 


47 


II25I 


47 


12290 


47 


13314 


47 


14326 


47 


15347 


47 


16367 


48 


1 55 


48 


Z 58 


48 


3 61 


48 


4 92 


48 


5114 


48 


6118 


48 


7139 


48 


8147 


48 


9148 


48 


10189 


48 


II 194 


48 


12201 


48 


13205 


48 


14209 


48 


15210 


41 


16238 


48 


17250 


48 


18255 


48 


19262 


48 


20266 


48 


21267 


48 


22268 


48 


23272 


48 


24278 


48 


25279 


48 


26282 


48 


27294 


48 


28309 


48 


29316 


48 30320 


48 


31336 


48 


32337 


48 


33339 


48 


34354 


49 


1 16 


49 


2 26 


49 


3 31 


49 


4 59 


49 


5 62 


49 


6 63 


49 


7108 


49 


8139 


49 


9147 


49 


10174 


49 


III85 


49 


12189 


49 


13202 


49 


14204 


49 


15258 


49 


16267 


49 


17290 


49 


I83II 


49 


19338 


49 


20344 


49 


21367 


49 


22368 


50 


1136 


50 


2139 


50 


3148 


50 


4152 


50 


5153 


50 


6184 


50 


7237 


50 


8239 


50 


9250 


50 


10261 


50 


11265 


50 


12267 


50 


13284 


50 


14301 


50 


15307 


50 


16312 


50 


17324 


50 


18365 


51 


1 2 


51 


2 30 


51 


3 51 


51 


4 59 


51 


5 69 


51 


6102 


51 


7118 


51 


8119 


51 


9307 



205 



51 


10342 


52 


1 4 


52 


2 27 


52 


3 35 


52 


4 37 52 


5 39 


52 


4 42 


92 


7 43 


52 


II 44 


52 


9 40 52 


10 48 


52 


11 49 


52 


12 70 


52 


13105 


52 


14114 


52 


19124 


52 


16 UO 


52 


17141 


52 


18154 


52 


19140 


52 


20173 


52 


21207 


52 


22208 


92 


23213 


52 


24221 


52 


25223 


52 


24244 


52 


27252 


52 


28259 


52 


29242 


52 


30272 


92 


31303 


52 


32312 


52 


33313 


52 


34314 


52 


35322 


52 


34342 


52 


37347 


53 


1 34 


93 


2 49 


53 


3 95 


53 


4139 


53 


5240 


53 


4242 


53 


7243 


53 


8299 


53 


9302 


93 


10312 


53 


11319 


53 


12331 


54 


1 4 


54 


2 17 


54 


3 44 


54 


4112 


54 


9190 


94 


4192 


54 


72S4 


54 


8245 


54 


9282 


54 


10312 


54 


11324 


54 


12328 


54 


13339 


94 


14340 


55 


1 54 


55 


2 70 


55 


3108 


55 


4118 


55 


5122 


55 


6130 


55 


7148 


99 


8190 


55 


9173 


55 


10229 


55 


11242 


55 


12257 


55 


13247 


55 


14248 


55 


19270 


99 


14280 


55 


17303 


55 


18342 


55 


19348 


54 


1 8 


54 


2 34 


54 


3 95 


54 


3147 


94 


9174 


54 


4184 


54 


7244 


54 


8253 


54 


9279 


54 


10284 


54 


11312 


54 


12314 


94 


13343 


56 


U34|. 




t 1 11 


57 


2101 


57 


3148 


57 


4184 


57 


5279 


57 


4312 


97 




57 




58 


1 i • 


58 


2 24 


58 


3 40 


58 


4 49 


58 


5148 


58 


4243 


98 


■Sy'i. 


58 


113 14] 


58. 


i 9324 


58 


10344 


59 


1 11 


59 


2 55 


59 


3 57 


99 


4 99 


99 


5f 


4 4« 59 


7 83 


59 


8118 


59 


9119 


59 


10152 


59 


11170 


59 


12187 


49 


U>72 


5f 


14314* ^o 


1 11 


40 


2 99 40 


3137 


40 


4187 


40 


5221 


40 


4238 40 


7290 


40 


•272 


40 


9314 


40 


10328 


41 


1 47 


41 


2 49 


41 


3104 


41 


4134 41 


9148 


61 


4190 


%1 


7270 


41 


8290 41 


9327 41 


10344 


42 


1 45 


42 


2 94 42 


3 99 


42 


4 85 


42 


5 87 


42 


4108 


42 


7114 


42 


8130 42 


9148 


42 


10183 42 


11202 


62 


12234 


$2 


13305 


42 


14309 42 


15330 42 


14341 


43 


i 30 


43 


2118 


43 


3119 


43 


4148 


43 


5130 43 


4272 


43 


7345 


43 


8348 


44 


1 37 


44 


2 43 


44 


3 99 


64 


4102 


44 


5124 


44 


4148 


44 


7150 44 


8153 


44 


9144 


44 


10173 44 


11184 


44 


12223 


44 


13238 


44 


14259 


44 


15284 


44 


14312 


44 


17331 


49 


1114 


49 


2118 


65 


3150 


45 


4152 


45 


5172 


45 


4173 


45 


7250 45 


8242 


45 


9292 


49 


10334 


45 


11354 


45 


12359 


44 


1 85 


44 


2104 44 


3148 


44 


4150 


44 


9148 


44 


4182 


66 


7314 




8317 


44 


9325 


44 


10328 


44 


11332 


44 


12339 


44 


13342 


44 


14347 


44 


15348 


47 


1 12 


47 


2 97 


44 


3104 


47 


4131 


47 


5134 47 


4199 


47 


7174 


67 


8225 


47 


9280 


47 


10290 47 


11314 


47 


12324 47 


13328 


47 


14338 


47 


19343 


47 


14348 


47 


17348 


48 


1 18 


48 


2 19 


48 


3 44 


48 


4 49 


48 


9 99 


48 


4104 


6t 


7144 


48 


8153 


48 


9144 48 


10181 


48 


11202 


48 


12243 


48 


13314 


vS8 


14328 


4t 


15348 


49 


1 25 


49 


2 55 


49 


3 73 


49 


4 83 


49 


5104 


49 


4108 


49 


7114 


6f 


6134 


$9 


9140 


49 


10148 


49 


11150 49 


12152 


49 


13158 


49 


14144 49 


19148 


4f 


14242 


49 


17283 


49 


18288 


49 


19317 


49 


20322 


49 


21339 


49 


22342 


49 


23347 


70 


1 30 


70 


2 74 


70 


3102 


70 


4108 


70 


5118 


70 


4158 


70 


7184 


70 


8187 


70 


9237 


70 


10251 


70 


11257 


70 


12247 


70 


13290 


70 


14311 


70 


19339 


70 


14344 


70 


17348 


71 


1 29 


71 


2 74 


71 


3104 


71 


4108 


71 


5118 


71 


4119 


71 


7142 


71 


8251 


71 


9248 


71 


10331 


71 


11359 


71 


12349 


72 


1 99 


72 


2 94 


72 


3112 


72 


3118 


72 


4119 


72 


5194 


72 


4228 


72 


7247 


72 


8270 


72 


9273 


72 


10334 


73 


1 45 


73 


2 92 


73 


3108 


73 


4114 73 


5147 


73 


4148 


73 


7173 


73 


8229 


73 


9257 


73 


10247 


73 


11273 


73 


12284 


73 


13354 


73 


14354 


74 


1 48 


74 


2102 


7* 


3108 


74 


4118 


74 


5148 


74 


4147 


74 


7189 


74 


8247 


74 


9319 


79 


J, 63 


7$ 


2 83 


75 


3108 


75 


4113 


75 


5114 


75 


4118 


75 


7142 


75 


8189 


79 


9228 


7$ 


11250 


75 


12254 


75 


13257 


75 


14248 


75 


15272 


75 


14273 


79 


17284 


74 


1 29 


74 


2 58 


74 


3104 


74 


4137 


74 


5148 


74 


4187 


74 


7189 


74 


8249 


74 


9290 


74 


10314 


74 


11329 


74 


12328 


74 


13344 


74 


14348 


77 


1 11 


77 


2 49 


77 


3147 


77 


4202 


77 


5229 


77 


4272 


77 


7314 


77 


8314 


77 


9317 


77 


10344 


78 


1192 


7t 


2154 


78 


3143 


78 


4203 


78 


5218 


78 


4250 


78 


7294 


78 


8273 


78 


9283 


7t 


10284 


78 


11354 


79 


1 18 


79 


2 49 


79 


3104 


79 


4123 


79 


9129 


79 


4202 


7» 


7217 


79 


8243 


79 


9251 


79 


10290 


79 


11324 


79 


12341 


7V 


13344 


79 


1439* 


•0 


1 44 


80 


2 44 


80 


3 84 


80 


4113 


80 


5151 


80 


4152 


80 


7148 80 


8192 


to 


9197 


80 


10228 


80 


11240 80 


12259 


80 


13279 80 


14273 


80*19333 


81 


1 29 


•1 


2 31 


81 


3 45 


81 


4 58 


81 


5 42 


81 


4 7«_U, 


7108 


81 


8140 81 


9187 


• 1 


10272 


81 


11290 81 


12315 


81 


13324 81 


14348 81 


15348 


81 


14349 


82 


1 40 


•2 


2 44 


82 


3128 


82 


4151 


82 


5152 82 


4303 


82 


7338 


82 


8394 


83 


1 37 


•3 


2 49 


83 


3147 


83 


4148 


83 


5148 


83 


4181 


83 


7182 


83 


8240 


83 


9317 


•3 


10328 


83 


11337 


84 


1 10 84 


2 38 


84 


3 49 


84 


4 83 


84 


9179 84 


4219 



206 



«4 


7243 


8« 


82S0 


84 


926S 84 


10273 84 


11284 


85 


1 25 


85 


2 59 


85 


3 60 


•S 


4113 


8S 


S118 


8S 


6189 8S 


7233 85 


8250 


85 


9257 


85 


10283 


85 


11284 


US 


12290 


8S 


13311 


8S 


14339 86 


1 89 86 


2108 


86 


3118 


86 


4148 


86 


5168 


th 


4228 


8« 


7233 


86 


82S4 86 


9257 86 


10267 


86 


11273 


86 


12305 


86 


13354 


•ft 


14344 


86 


1S366 


87 


1 10 87 


2 38 87 


3 83 


87 


4103 


87 


5178 


8/ 


. 6197 


•7 


7273 


87 


8294 


87 


9338 88 


1 11 88 


2 21 


88 


3 31 


88 


4 69 


8'8 


5103 




4112 


•8 


7139 


88 


81S1 88 


9153 88 


10155 


88 


11218 


88 


12227 


88 


13240 




142S8 


88 


1S270 


88 


16301 88 


17312 88 


18326 


88 


19328 


88 


20333 


88 


21341 


119 


1 34 


89 


2 43 


•9 


3 S6 89 


4 69 89 


5116 


89 


6126 


89 


7129 


89 


8131 


•« 


91S3 


89 


102S9 


89 


11272 89 


12291 89 


13306 


89 


14314 


90 


1 18 


90 


2 29 


90 


3 30 


90 


4 31 


90 


S 69 90 


6104 90 


7108 


90 


8166 


90 


9167 


90 


10180 


90 


11240 


90 


12311 


90 


13339 90 


14369 91 


1 18 


91 


2 22 


92 


3 68 


91 


4137 


91 


S147 


91 


6212 


91 


7237 91 


8384 91 


9309 


91 


10314 


92 


1 30 


92 


2 69 


92 


3102 


92 


4108 


92 


S139 92 


6148 92 


7153 


92 


8250 


92 


9268 


92 


10272 


92 


11273 


92 


12277 


92 


13284 92 


14317 93 


1 10 


93 


2 38 


93 


3 62 


93 


4 69 


93 


S102 


93 


6126 


93 


7127 93 


8150 93 


9161 


93 


10178 


93 


ii:i8i 


93 


12212 


93 


13214 


93 


142S0 


93 


1S1S4 93 


16265 93 


17284 


93 


18291 


93 


19294 


93 


20365 


94 


1 31 


94 


2 89 


94 


3 90 94 


4108 94 


5113 


94 


6233 


94 


7254 


94 


8257 


94 


9248 


94 


10273 


94 


113S6 94 


12361 95 


1 58 


95 


2 59 


95 


3 62 


95 


4 70 


9S 


S 8S 


9S 


6104 


9S 


7108 9S 


8140 95 


9181 


95 


10197 


95 


11240 


95 


12273 


9S 


13283 


9S 


14290 9S 


1S311 9S 


16326 95 


17337 


95 


18347 


95 


19368 


96 


1 83 


9h 


2108 


9S 


3117 


96 


4118 96 


5148 96 


6153 


96 


7i2a 


^)6 


8233 


96 


9237 


9ft 


102S0 


96 


112S7 96 


12268 96 


13273 96 


14283 


96 


15284 


97 


I f-4 


97 


2 55 


97 


3 83 


97 


4 89 


97 


S102 97 


6121 97 


7122 


97 


8124 


9? 


9148 


97 


10166 


97 


1117S 


97 


12233 


97 


132S'' 97 


14275 97 


15280 


97 


16317 


97 


17328 


97 


18333 


97 


193S4 


98 


1 2 


98 


2 68 98 


3 75 98 


4103 


98 


5118 


98 


6119 


98 


7189 


98 


8194 


98 


9197 


98 


1024S 98 


11249 93 


12336 


99 


1 45 


99 


2 55 


99 


3 70 


99 


4134 


99 


S148 


99 


61S0 99 


7174 99 


8212 


99 


9222 


99 


10265 


99 


11290 


99 


1231S 


99 


13327100 


1 2S100 


2 55100 


3 57100 


3148100 


5167100 


6181 


100 


7187100 


8272100 


9273100 


10310100 


11327100 


12368101 


1 54101 


2 69 


101 


3 77101 


4 78101 


S 79101 


6126101 


7148101 


8150101 


9167101 


10240 


101 


11267101 


12284101 


1333S101 


14337101 


15339l01i-l6364l02 


1 69102 


2 93 


102 


312S102 


41S0102 


51S3102 


6200102 


7241102 


8250102 


9269102 


10289 


102 


11328102 


12356102 


13365102 


14 0102 


15 0102 


16 0102 


17 0102 


18 0 



4 

i 

4. 



207 



Appendix D 



ILLUSTRATIONS OF COMPUTATIONS TO 
ESTIMATE RETRIEVAL QUANTITY 



i 



El4 



Question 1. 



Form: • Tg • (T3 + + T5) - Tg 



Term Frequencies: f(T^) = 21 

11 
20 
53 
44 
2 



f(T4) 



f(Tg) 



f(T,') = f(T, • W = (4.7) (21 . 11) ■ 2.7 
' ^ ^ 400 



fdp') = fd, + T-) = 20 + 53 - (3.5) (20 ' 53) 

400 



fdj') = fdg' + Tg) = 62 + 44 - (3.75)1 



f(T.') = f(T,' • T3') = (4)(2.7 . 80) = 2 
' 400 



f(T5') = f(T ' . Tg) = (100)(2.2) ^ 1 

' ^ ° "loo 



2 IF f(Tg') = 0 
1 IF f^g') = 1 



NOTE: All y's from Fig. 5.43. 



209 



Question 8. 

Form: • (Tg + T3) 

Term Frequencies: f(T^) = 20 

fdg) = 10 
fdg) = 84 

f (T, ') = f (T, + T,) = 10 + 84 - (2.8) (10.84) = 88 
' ^ 4d5 

f(T«') = f(T, . T,') = (3.4) (20 ' 88) ° 15 

400 

R„ = 15 , 



210 



Question 14. 



Form: • Tg 



Term Frequency: f{T^) = 38 
fdg) = 31 



